Javascript - regex - word boundary (\b) issue - javascript

I have a difficulty using \b and greek characters in a regex.
At this example [a-zA-ZΆΈ-ώἀ-ῼ]* succeeds to mark all the words I want (both greek and english). Now consider that I want to find words with 2 letters. For the English language I use something like this:
\b[a-zA-Z]{2}\b. Can you help me write a regex that succeeds to mark words in Greek with 2 letters? (Why? My final goal is to remove them).
text used:
Greek MONOTONIC:
Το γάρ ούν και παρ' υμίν λεγόμενον, ώς ποτε Φαέθων Ηλίου παίς το του πατρός άρμα ζεύξας δια το μή δυνατός είναι κατά την του πατρός οδόν ελαύνειν τα τ' επί της γής ξυνέκαυσε και αυτός κεραυνωθείς διεφθάρη, τούτο μύθου μέν σχήμα έχον λέγεται, το δέ αληθές εστι των περί γήν και κατ' ουρανόν ιόντων παράλλαξις και διά μακρόν χρόνον γιγνομένη των επί γής πυρί πολλώ φθορά.
Greek POLYTONIC:
Τὸ γὰρ οὖν καὶ παρ' ὑμῖν λεγόμενον, ὥς ποτε Φαέθων Ἡλίου παῖς τὸ τοῦ πατρὸς ἅρμα ζεύξας διὰ τὸ μὴ δυνατὸς εἶναι κατὰ τὴν τοῦ πατρὸς ὁδὸν ἐλαύνειν τὰ τ' ἐπὶ τῆς γῆς ξυνέκαυσε καὶ αὐτὸς κεραυνωθεὶς διεφθάρη, τοῦτο μύθου μὲν σχῆμα ἔχον λέγεται, τὸ δὲ ἀληθές ἐστι τῶν περὶ γῆν καὶ κατ' οὐρανὸν ἰόντων παράλλαξις καὶ διὰ μακρὸν χρόνον γιγνομένη τῶν ἐπὶ τῆς γῆς πυρὶ πολλῷ φθορά.
ENGLISH:
For in truth the story that is told in your country as well as ours, how once upon a time Phaethon, son of Helios, yoked his father's chariot, and, because he was unable to drive it along the course taken by his father, burnt up all that was upon the earth and himself perished by a thunderbolt,—that story, as it is told, has the fashion of a legend, but the truth of it lies in the occurrence of a shifting of the bodies in the heavens which move round the earth, and a destruction of the things on the earth by fierce fire, which recurs at long intervals.
what I've tried so far:
// 1
txt = txt.replace(/\b[a-zA-ZΆΈ-ώἀ-ῼ]{2}\b/g, '');
// 2
tokens = txt.split(/\s+/);
txt = tokens.filter(function(token){ return token.length > 2}).join(' ');
// 3
tokens = txt.split(' ');
txt = tokens.filter(function(token){ return token.length != 3}).join(' ') );
2 & 3 were suggested to my question here: Javascript - regex - how to remove words with specified length
EDIT
Read also:
Why can't I use accented characters next to a word boundary?
Javascript + Unicode regexes

Since Javascript doesn't have the lookbehind feature and since word boundaries work only with members of the \w character class, the only way is to use groups (and capturing groups if you want to make a replacement):
(?m)(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])
example to remove 2 letters words:
txt = txt.replace(/(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])/gm, '\1');

You can use \S
Rather than write a match for "word characters plus these characters" it may be appropriate to use a regex that matches not-whitespace:
\S
It's broader in scope, but simpler to write/use.
If that's too broad - use an exclusive list rather than an inclusive list:
[^\s\.]
That is - any character that is not whitespace and not a dot. In this way it's also easy to add to the exceptions.
Don't try to use \b
Word boundaries don't work with none-ascii characters which is easy to demonstrate:
> "yay".match(/\b.*\b/)
["yay"]
> "γaγ".match(/\b.*\b/)
["a"]
Therefore it's not possible to use \b to detect words with greek characters - every character is a matching boundary.
Match 2 character words
The following pattern can be used to match two character words:
pattern = /(^|[\s\.,])(\S{2})(?=$|[\s\.,])/g;
(More accurately: to match two none-whitespace sequences).
That is:
(^|[\s\.,]) - start of string or whitespace/punctuation (back reference 1)
(\S{2}) - two not-whitespace characters (back reference 2)
($|[\s\.,]) - end of string or whitespace/punctuation (positive lookahead)
That pattern can be used like so to remove matching words:
"input string".replace(pattern);
Here's a jsfiddle demonstrating the patterns use on the texts in the question.

Try something like this:
\s[a-zA-ZΆΈ-ώἀ-ῼ]{2}\s

Related

Javascript regex select word without special characters

I want to match word without special characters(dot, quotes, etc.) or whitespaces. The text I have
"üstlenmeyeceğimizin üst "ürünlerin daha sağlıklı ve zamanında ulaşabilmesi süstlenmeyeceğimizin şehirlerarası otobüs şirketleriyle çalıştığımızı fakat ısrarınız üstüne oluşabilecek gecikme veya sorunları üstlenmeyeceğimizin teyidini alarak kargoyla gönderim sağladık." üstlenmeyeceğimizin ğtest atest üstlenmeyeceğimizind. test test üst şüst a ğüst .üst üst.büst she sells seashells tüst atest ni ani grüst
asla ısrar etmedim ve ürünlerin sağlığı için i yi olduğuna dair bi r bilgilendirme yapılmadı.
I want to select üst from this text but there is some different situations like below.
I don't want match words listed below
ğüst
şüst
üstlenmeyeceğimizin
"üstlenmeyeceğimizin
I want select those listed words
.üst
üst (there is whitespace before word)
"üst
I wrote this regex: [^a-zçğşöü]üst(?![a-zçğşöü]) but this regex selects word with special characters. I don't want special characters.
Shortly I don't want select if word has any leading letters or whitespace but if there is any special character leading the word I want to select it without this special character
If you do really want only those 3 words:
I want select those listed words
.üst
üst (there is whitespace before word)
"üst
As you have asked in your question, then should be enough:
[" .]üst\b
demo: https://regex101.com/r/kfMZxr/1/
If now you want to include whole words in your matches use:
(?!şğ)[^\s]*üst(?!lenmeyeceğimizin)[^\s]*
https://regex101.com/r/kfMZxr/2
This will allow matches as tüst and grüst in your text as well as büst and üstüne.
(based on: http://www.ecma-international.org/ecma-262/9.0/index.html)
there is no "üst in your text, you mean "üstlenmeyeceğimizin ?
any way check this [.\s"]üst\b you can test it
here
I need to match words and replace them with html tags, for example:
üst -> <b>üst</b>
I solved my problem with this regex
([^a-zçğşöü])üst(?![a-zçğşöü])
This regex also matches whitespaces and special characters but I grouped them in regex then excluded from html tags when replacing.
Regex test: https://regexr.com/4amj3
Working example: https://codepen.io/asipek/pen/xBQGPK?editors=0011

Regex for the following for removing first set of brackets

I need to strip out the "( ( listen) LEE-mər)" in the following text in javascript, including outer brackets. The content within the outer brackets dynamically changes. I also don't want to strip the next set of brackets (ghosts or spirits)
Here is what I found on Wikipedia: Lemurs ( ( listen) LEE-mər) are a clade of strepsirrhine primates endemic to the island of Madagascar. The word lemur derives from the word lemures (ghosts or spirits) from Roman mythology and was first used to describe a slender loris due to its nocturnal habits and slow pace, but was later applied to the primates on Madagascar."
I got as far as
/\((.*?\()*/g
but it doesn't work. Is it possible to do in regex?
Try this regex :
/\(([^)]*)\)[^)]*\)/
To retrieve only what you want :
var text = "Here...";
var response = text.match(/\(([^)]*)\)[^)]*\)/)[0];
You could try this regular expression to match "( ( listen) LEE-mər)" and remove it from the text:
\(\s*(?=\(\s*listen).+?\).+?\)
This does not match (ghosts or spirits).
You can test it on this link.
Here is an example in Javascript:
const str = "Here is what I found on Wikipedia: Lemurs ( ( listen) LEE-mər) are a clade of strepsirrhine primates endemic to the island of Madagascar. The word lemur derives from the word lemures (ghosts or spirits) from Roman mythology and was first used to describe a slender loris due to its nocturnal habits and slow pace, but was later applied to the primates on Madagascar.";
const transformed = str.replace(/\(\s*(?=\(\s*listen).+?\).+?\)/g, "");
document.getElementById("str").innerHTML = str;
document.getElementById("transformed").innerHTML = transformed;
<div id="str"></div>
<br />
<div id="transformed"></div>
Regex explanation
The regex looks first for a left round bracket followed by zero or more spaces:
\(\s*
which is followed by "listen" in brackets (using positive lookahead):
(?=\(\s*listen)
This is then followed by any non greedy number of characters followed by a bracket two times:
.+?\).+?\)
Regex optimisation
You could optimize the original regex with this one:
\(\s*(?=\(\s*listen)(.+?\)){2}

Need a javascript function that doesn't care about special characters in a string

I have this html string x:
Michelle Brook
<br></br>
The Content Mine
<br></br>
michelle#contentmine.org
It is taken from first lines of http://www.dlib.org/dlib/november14/brook/11brook.html
I would like to obtain x.substring(0,14)=Michelle Brook.
The problem is that before the M, there are two special characters (unicode code=10) that makes x.substring(0,14)=Michelle Bro.
In fact, using x.split("") i can see {" "," ","M",.....}
I wouldn't remove these characters.
I would like to make substring doing the right thing "keeping in mind" that special characters. How could i do? Is there a different javascript function that makes that?
From your webpage:
window.onload = function() {
var arrStr = document.getElementsByClassName('blue')[0].innerHTML.replace(/[^A-Za-z0-9 <>]/g, '').split('<br>');
alert(arrStr[0].trim());
}
<p class="blue">
Michelle Brook<br>
The Content Mine<br>
michelle#contentmine.org<br><br>
Peter Murray-Rust<br>
University of Cambridge<br>
pm286#cam.ac.uk<br><br>
Charles Oppenheim<br>
City, Northampton and Robert Gordon Universities<br>
c.oppenheim#btinternet.com
<br><br>doi:10.1045/november14-brook
</p>
With the replace function you can remove any character is out of your interest:
in your case I considered you are looking for letters (uppercase, lowercase), numbers and space. You can add other characters to remove.
y cant u alote this strimg in a functio and trim start and end of the String.
Use .trim to remove \n (code 10)
The trim() method removes whitespace from both ends of a string.
Whitespace in this context is all the whitespace characters (space,
tab, no-break space, etc.) and all the line terminator characters (LF,
CR, etc.).
x.trim().substring(0,14);
Or using a regex:
var match = x.match(/[\w ]{14}/);
console.log(match[0]);

split line via regex in javascript?

I have this structure of text :
1.6.1 Members................................................................ 12
1.6.2 Accessibility.......................................................... 13
1.6.3 Type parameters........................................................ 13
1.6.4 The T generic type aka <T>............................................. 13
I need to create JS objects :
{
num:"1.6.1",
txt:"Members"
},
{
num:"1.6.2",
txt:"Accessibility"
} ...
That's not a problem.
The problem is that I want to extract values via Regex split via positive lookahead :
Split via the first time you see that next character is a letter
What have i tried :
'1.6.1 Members........... 12'.split(/\s(?=(?:[\w\. ])+$)/i)
This is working fine :
["1.6.1", "Members...........", "12"] // I don't care about the 12.
But If I have 2 words or more :
'1.6.3 Type parameters................ 13'.split(/\s(?=(?:[\w\. ])+$)/i)
The result is :
["1.6.3", "Type", "parameters................", "13"] //again I don't care about 13.
Of course I can join them , but I want the words to be together.
Question :
How can I enhance my regex NOT to split words ?
Desired result :
["1.6.3", "Type parameters"]
or
["1.6.3", "Type parameters........"] // I will remove extras later
or
["1.6.3", "Type parameters........13"]// I will remove extras later
NB
I know I can do split via " " or by other simpler solution but I'm seeking ( for pure knowledge) for an enhancement for my solution which uses positive lookahead split.
Full online example :
nb2 :
The text can contain capital letter in the middle also.
You can use this regex:
/^(\d+(?:\.\d+)*) (\w+(?: \w+)*)/gm
And get your desired matches using matched group #1 and matched group #2.
Online Regex Demo
Update: For String#split you can use this regex:
/ +(?=[A-Z\d])/g
Regex Demo
Update 2: With the possibility of having capital letters also in chapter names following more complex regex is needed:
var re = /(\D +(?=[a-z]))| +(?=[a-z\d])/gmi;
var str = '1.6.3 Type Foo Bar........................................................ 13';
var m = str.split( re );
console.log(m[0], ',', m.slice(1, -1).join(''), ',', m.pop() );
//=> 1.6.3 , Type Foo Bar........................................................ , 13
EDIT: Since you added 1.6.1 The .net 4.5 framework.... to the requirements, we can tweak the answer to this:
^([\d.]+) ((?:[^.]|\.(?!\.))+)
And if you want to allow sequences of up to three dots in the title, as in 1.6.1 She said... Boo!..........., it's an easy tweak from there ({3} quantifier):
^([\d.]+) ((?:[^.]|\.(?!\.{3}))+)
Original:
^([\d.]+) ([^.]+)
In the regex demo, see the Groups in the right pane.
To retrieve Groups 1 and 2, something like:
var myregex = /^([\d.]+) ((?:[^.]|\.(?!\.))+)/mg;
var theMatchObject = myregex.exec(yourString);
while (theMatchObject != null) {
// the numbers: theMatchObject[1]
// the title: theMatchObject[1]
theMatchObject = myregex.exec(yourString);
}
OUTPUT
Group 1 Group 2
1.6.1 Members
1.6.2 Accessibility
1.6.3 Type parameters
1.6.4 The T generic type aka <T>**
1.6.1 The .net 4.5 framework
Explanation
^ asserts that we are a the beginning of the line
The parentheses in ([\d.]+) capture digits and dots to Group 1
The parentheses in ((?:[^.]|\.(?!\.))+) capture to Group 2...
[^.] one char that is not a dot, | OR...
\.(?!\.) a dot that is not followed by a dot...
+ one or more times
You can use this pattern too:
var myStr = "1.6.1 Members................................................................ 12\n1.6.2 Accessibility.......................................................... 13\n1.6.3 Type parameters........................................................ 13\n1.6.4 The T generic type aka <T>............................................. 13";
console.log(myStr.split(/ (.+?)\.{2,} ?\d+$\n?/m));
About a way with a lookahead :
I don't think it is possible. Because the only way to skip a character (here a space between two words), is to match it on the occasion of the previous occurence of a space (between the number and the first word). In other words, you use the fact that characters can not be matched more than one time.
But if, except the space where you want to split, all the pattern is enclosed in a lookahead, and since the substring matched by this subpattern in the lookahead isn't a part of the match result (in other words, it's only a check and the corresponding characters are not eaten by the regex engine), you can't skip the next spaces, and the regex engine will continue his road until the next space character.

Automatically generate tags from strings with javascript

I need to -automatically- generate tags for a text string. In this case, I'll use this string:
var text = 'This text talks about loyalty in the Royal Family with Príncipe Charles';
My current implementation, generates the tags for words that are 6+ characters long, and it works fine.
words = (text).replace(/[^a-zA-Z\s]/g,function(str){return '';});
words = words.match(/\w{6,}/g);
console.log(words);
This will return:
["loyalty","Family","Prince","Charles"]
The problem is that sometimes, a tag should be a specific set of words. I need the result to be:
["loyalty","Royal Family","Príncipe Charles"]
That means, that the replace/match code should test for:
words that are 6 characters long (or more); and/or
if a set of words starts with an uppercase letter, those words should be joined together in the same array element. It doesn't matter if some of the words are less than 6 characters long - but at least one of them has to be 6+, e.g.: "Stop at The UK Guardián in London" should return ["The UK Guardián", "London"]
I'm obviously having trouble in the second requirement. Any ideas? Thanks!
var text = 'This text talks about loyalty in the Royal Family with Prince Charles. Stop at The UK Guardian in London';
text.match(/(([A-Z]\w*\s*){2,})|(\w{6,})/g)
will return
["loyalty", "Royal Family ", "Prince Charles", "The UK Guardian ", "London"]
To fulfill the second requirement, it's better to run another regexp over the matches found:
var text = 'This is a Short Set Of Words about the Royal Family'
matches = text.match(/(([A-Z]\w*\s*){2,})|(\w{6,})/g)
matches.filter(function(m) {
return m.match(/\w{6,}/)
});
Okay, here's an idea. This is probably not the very best way to do this, but it might be a good start for you.
In order matching strings like Royal Family and Prince Charles, or perhaps even The United Kingdom, you could write a regex that looks for a succession of words starting with a capital letter in succession.
This might look like this: (A-Z(a-z){5,}* )+
You could then use the replace function to generate a new string with the matches removed and then use your original regex to match single words of a minimum length.
Update: In response to the comment about the other users answer, I have added the {5,} modifier to indicate a capital letter followed by five or more lower case letters and a space, one or more times.

Categories