Javascript regex select word without special characters

Javascript regex select word without special characters - javascript

I want to match word without special characters(dot, quotes, etc.) or whitespaces. The text I have
"üstlenmeyeceğimizin üst "ürünlerin daha sağlıklı ve zamanında ulaşabilmesi süstlenmeyeceğimizin şehirlerarası otobüs şirketleriyle çalıştığımızı fakat ısrarınız üstüne oluşabilecek gecikme veya sorunları üstlenmeyeceğimizin teyidini alarak kargoyla gönderim sağladık." üstlenmeyeceğimizin ğtest atest üstlenmeyeceğimizind. test test üst şüst a ğüst .üst üst.büst she sells seashells tüst atest ni ani grüst
asla ısrar etmedim ve ürünlerin sağlığı için i yi olduğuna dair bi r bilgilendirme yapılmadı.
I want to select üst from this text but there is some different situations like below.
I don't want match words listed below
ğüst
şüst
üstlenmeyeceğimizin
"üstlenmeyeceğimizin
I want select those listed words
.üst
üst (there is whitespace before word)
"üst
I wrote this regex: [^a-zçğşöü]üst(?![a-zçğşöü]) but this regex selects word with special characters. I don't want special characters.
Shortly I don't want select if word has any leading letters or whitespace but if there is any special character leading the word I want to select it without this special character

If you do really want only those 3 words:
I want select those listed words
.üst
üst (there is whitespace before word)
"üst
As you have asked in your question, then should be enough:
[" .]üst\b
demo: https://regex101.com/r/kfMZxr/1/
If now you want to include whole words in your matches use:
(?!şğ)[^\s]*üst(?!lenmeyeceğimizin)[^\s]*
https://regex101.com/r/kfMZxr/2
This will allow matches as tüst and grüst in your text as well as büst and üstüne.
(based on: http://www.ecma-international.org/ecma-262/9.0/index.html)

there is no "üst in your text, you mean "üstlenmeyeceğimizin ?
any way check this [.\s"]üst\b you can test it
here

I need to match words and replace them with html tags, for example:
üst -> <b>üst</b>
I solved my problem with this regex
([^a-zçğşöü])üst(?![a-zçğşöü])
This regex also matches whitespaces and special characters but I grouped them in regex then excluded from html tags when replacing.
Regex test: https://regexr.com/4amj3
Working example: https://codepen.io/asipek/pen/xBQGPK?editors=0011

Related

Trying to write a regex where a newline may appear anywhere in a group

I'm trying to make a regex divide text into two parts and ignore everything that comes after these two parts.
The (insufficient) regex I'm trying to use is:
/Artikelnummer(?:(&&&))(.*)(?:\s*.*)\W?(?:Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite&&&)((.*)&&&(.*)&&&(\d)+)*/
The text I'm matching is saved at these links:
https://regex101.com/r/VDnUoe/1
https://regex101.com/r/j62Mw0/2
Part 1) Everything after Artikelnummer and before Dokumentation... (easy to match)
Part 2) Everything after (?:Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite&&&) that follows the pattern:
text&&&text&&&digits
In one of the above links, the above pattern works except for a new line that is thrown in, which causes some text to be left out that should be included.
The first part is matched:
all&&&Vorwort&&&1&&&all&&&Sicherheit&&&2&&&all&&&Richtlinien und Normen&&&3&&&all&&&Produktbeschreibung&&&4&&&all&&&Installation&&&5&&&all&&&Wichtige Informationene zur Inbetriebnahme&&&6&&&all&&&Projektierung - Wichtige Infos&&&7&&&all&&&Anhang 1&&&8&&&all&&&Anhang 2&&&9&&&all&&&Anhang 3&&&10&&&all&&&Anhang 4&&&11&&&all&&&Anhang 5&&&12&&&all&&&Anhang 6&&&13&&&all&&&Anhang 7&&&14&&&all&&&Anhang 8&&&15&&&all&&&Anhang 9&&&16&&&all&&&Anhang 10&&&17&&&all&&&Anhang 11&&&18&&&all&&&Anhang 12&&&19&&&all&&&Anhang 13&&&20&&&all&&&Anhang 14&&&21&&&all&&&Anhang 15&&&22&&&all&&&Anhang 16&&&23&&&all&&&Anhang 17&&&24&&&all&&&Anhang 18&&&25&&&all&&&Anhang 19&&&26&&&all&&&Anhang 20&&&27&&&all&&&Anhang 21&&&28&&&all&&&Anhang 22&&&29&&&all&&&Anhang 23&&&30&&&all&&&Anhang 24&&&31&&&all&&&Anhang 25&&&32&&&all&&&Anhang 26&&&33
And then this isn't matched, because a newline is inserted:
all&&&Anhang 27&&&34&&&all&&&Anhang 28&&&35&&&all&&&Anhang 29&&&36&&&all&&&Anhang 30&&&37&&&all&&&Anhang 31&&&38&&&all&&&Anhang 32&&&39&&&all&&&Anhang 33&&&40&&&all&&&Anhang 34&&&41&&&all&&&Anhang 35&&&42&&&all&&&Anhang 36&&&43&&&all&&&Anhang 37&&&44&&&all&&&Anhang 38&&&45
My question is, how can this regex be rewritten so that a newline could theoretically be placed anywhere within the second part of the text and still match everything I want?

I'm not sure this is what you want, anyway this regex works with newlines too:
Artikelnummer(?:(&&&))(.*)(?:\s*.*)\W?(?:Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite&&&)((.*)&&&(.*)&&&(\d)+(\n?)*)*
\n matches newline
? is the quantifier for zero or one (if newline is found or not)
* I added this one if more newline are encountered

I would try a regex like this:
(Artikelnummer([\n|\r| |\S]*)(?=Dokumentation))(([\n|\r| |\S]*&&&){2}\d+)*
Looking for the \n\r and all other non space chars.
Second I wouldn't use the ?: - for maching every find. The positive lookup ?= should give you the requirements for the first group.

Need a javascript function that doesn't care about special characters in a string

I have this html string x:
Michelle Brook
<br></br>
The Content Mine
<br></br>
michelle#contentmine.org
It is taken from first lines of http://www.dlib.org/dlib/november14/brook/11brook.html
I would like to obtain x.substring(0,14)=Michelle Brook.
The problem is that before the M, there are two special characters (unicode code=10) that makes x.substring(0,14)=Michelle Bro.
In fact, using x.split("") i can see {" "," ","M",.....}
I wouldn't remove these characters.
I would like to make substring doing the right thing "keeping in mind" that special characters. How could i do? Is there a different javascript function that makes that?

From your webpage:
window.onload = function() {
var arrStr = document.getElementsByClassName('blue')[0].innerHTML.replace(/[^A-Za-z0-9 <>]/g, '').split('<br>');
alert(arrStr[0].trim());
}
<p class="blue">
Michelle Brook<br>
The Content Mine<br>
michelle#contentmine.org<br><br>
Peter Murray-Rust<br>
University of Cambridge<br>
pm286#cam.ac.uk<br><br>
Charles Oppenheim<br>
City, Northampton and Robert Gordon Universities<br>
c.oppenheim#btinternet.com
<br><br>doi:10.1045/november14-brook
</p>
With the replace function you can remove any character is out of your interest:
in your case I considered you are looking for letters (uppercase, lowercase), numbers and space. You can add other characters to remove.

y cant u alote this strimg in a functio and trim start and end of the String.

Use .trim to remove \n (code 10)
The trim() method removes whitespace from both ends of a string.
Whitespace in this context is all the whitespace characters (space,
tab, no-break space, etc.) and all the line terminator characters (LF,
CR, etc.).
x.trim().substring(0,14);
Or using a regex:
var match = x.match(/[\w ]{14}/);
console.log(match[0]);

Javascript - regex - word boundary (\b) issue

I have a difficulty using \b and greek characters in a regex.
At this example [a-zA-ZΆΈ-ώἀ-ῼ]* succeeds to mark all the words I want (both greek and english). Now consider that I want to find words with 2 letters. For the English language I use something like this:
\b[a-zA-Z]{2}\b. Can you help me write a regex that succeeds to mark words in Greek with 2 letters? (Why? My final goal is to remove them).
text used:
Greek MONOTONIC:
Το γάρ ούν και παρ' υμίν λεγόμενον, ώς ποτε Φαέθων Ηλίου παίς το του πατρός άρμα ζεύξας δια το μή δυνατός είναι κατά την του πατρός οδόν ελαύνειν τα τ' επί της γής ξυνέκαυσε και αυτός κεραυνωθείς διεφθάρη, τούτο μύθου μέν σχήμα έχον λέγεται, το δέ αληθές εστι των περί γήν και κατ' ουρανόν ιόντων παράλλαξις και διά μακρόν χρόνον γιγνομένη των επί γής πυρί πολλώ φθορά.
Greek POLYTONIC:
Τὸ γὰρ οὖν καὶ παρ' ὑμῖν λεγόμενον, ὥς ποτε Φαέθων Ἡλίου παῖς τὸ τοῦ πατρὸς ἅρμα ζεύξας διὰ τὸ μὴ δυνατὸς εἶναι κατὰ τὴν τοῦ πατρὸς ὁδὸν ἐλαύνειν τὰ τ' ἐπὶ τῆς γῆς ξυνέκαυσε καὶ αὐτὸς κεραυνωθεὶς διεφθάρη, τοῦτο μύθου μὲν σχῆμα ἔχον λέγεται, τὸ δὲ ἀληθές ἐστι τῶν περὶ γῆν καὶ κατ' οὐρανὸν ἰόντων παράλλαξις καὶ διὰ μακρὸν χρόνον γιγνομένη τῶν ἐπὶ τῆς γῆς πυρὶ πολλῷ φθορά.
ENGLISH:
For in truth the story that is told in your country as well as ours, how once upon a time Phaethon, son of Helios, yoked his father's chariot, and, because he was unable to drive it along the course taken by his father, burnt up all that was upon the earth and himself perished by a thunderbolt,—that story, as it is told, has the fashion of a legend, but the truth of it lies in the occurrence of a shifting of the bodies in the heavens which move round the earth, and a destruction of the things on the earth by fierce fire, which recurs at long intervals.
what I've tried so far:
// 1
txt = txt.replace(/\b[a-zA-ZΆΈ-ώἀ-ῼ]{2}\b/g, '');
// 2
tokens = txt.split(/\s+/);
txt = tokens.filter(function(token){ return token.length > 2}).join(' ');
// 3
tokens = txt.split(' ');
txt = tokens.filter(function(token){ return token.length != 3}).join(' ') );
2 & 3 were suggested to my question here: Javascript - regex - how to remove words with specified length
EDIT
Read also:
Why can't I use accented characters next to a word boundary?
Javascript + Unicode regexes

Since Javascript doesn't have the lookbehind feature and since word boundaries work only with members of the \w character class, the only way is to use groups (and capturing groups if you want to make a replacement):
(?m)(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])
example to remove 2 letters words:
txt = txt.replace(/(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])([a-zA-ZΆΈ-ώἀ-ῼ]{2})(?![a-zA-ZΆΈ-ώἀ-ῼ])/gm, '\1');

You can use \S
Rather than write a match for "word characters plus these characters" it may be appropriate to use a regex that matches not-whitespace:
\S
It's broader in scope, but simpler to write/use.
If that's too broad - use an exclusive list rather than an inclusive list:
[^\s\.]
That is - any character that is not whitespace and not a dot. In this way it's also easy to add to the exceptions.
Don't try to use \b
Word boundaries don't work with none-ascii characters which is easy to demonstrate:
> "yay".match(/\b.*\b/)
["yay"]
> "γaγ".match(/\b.*\b/)
["a"]
Therefore it's not possible to use \b to detect words with greek characters - every character is a matching boundary.
Match 2 character words
The following pattern can be used to match two character words:
pattern = /(^|[\s\.,])(\S{2})(?=$|[\s\.,])/g;
(More accurately: to match two none-whitespace sequences).
That is:
(^|[\s\.,]) - start of string or whitespace/punctuation (back reference 1)
(\S{2}) - two not-whitespace characters (back reference 2)
($|[\s\.,]) - end of string or whitespace/punctuation (positive lookahead)
That pattern can be used like so to remove matching words:
"input string".replace(pattern);
Here's a jsfiddle demonstrating the patterns use on the texts in the question.

Try something like this:
\s[a-zA-ZΆΈ-ώἀ-ῼ]{2}\s

Capture words not followed by symbol

I need to capture all (english) words except abbreviations whose pattern are:
"_any-word-symbols-including-dash."
(so there is underscore in the beginning and dot in the end an any letters and dash in the middle)
I tried smthing like this:
/\b([A-Za-z-^]+)\b[^\.]/g
but i seems that I don't understand how to work with negative matches.
UPDATE:
I need not just to match but wrap the words in some tags:
"a some words _abbr-abrr. a here" I should get:
<w>a</w> <w>some</w> <w>words</w> _abbr-abbr. <w>a</w> <w>here</w>
So I need to use replace with correct regex:
test.replace(/correct regex/, '<w>$1</w>')

Negative lookahead is (?!).
So you can use:
/\b([^_\s]\w*(?!\.))\b/g
Unfortunately, there is no lookbehind in javascript, so you can't do similar trick with "not prefixed by _".
Example:
> a = "a some words _abbr. a here"
> a.replace(/\b([^_\s]\w*(?!\.))\b/g, "<w>$1</w>")
"<w>a</w> <w>some</w> <w>words</w> _abbr. <w>a</w> <w>here</w>"
Following your comment with -. Updated regex is:
/\b([^_\s\-][\w\-]*(?!\.))\b/g
> "abc _abc-abc. abc".replace(/\b([^_\s\-][\w\-]*(?!\.))\b/g, "<w>$1</w>")
"<w>abc</w> _abc-abc. <w>abc</w>"

Regular expression for wrapping every word in span tag on word boundaries in javascript

I want to wrap every word of a string in a <span> tag, without breaking any existing html tags and without including any punctuation marks.
For example the following string:
This... is, an. example! <em>string</em>?!
should be wrapped as:
<span>This</span>... <span>is</span>, <span>an</span>. <span>example</span>!
<span><em>string</em></span>?!
Ideally, I just need to wrap the words and nothing else.
Except for apostrophes, they should be wrapped, too.
it's => <span>it's</span>
give 'em => <span>give</span> <span>'em</span>
teachers' => <span>teachers'</span>
Right now I'm using a very simple regular expression:
str.replace(/([^\s<>]+)(?:(?=\s)|$)/g, '<span>$1</span>');
I found it somewhere here on stackoverflow. But it only wraps every word on white spaces and wraps punctuation marks too, which is undesirable in my case.
I know I should be ashamed for being so lousy at regular expressions.
Can someone please help me?
Many thanks!

Try this regex:
var str = "This string... it's, an. example! <em>string</em>?!";
str.replace(/([A-Za-z0-9'<>/]+)/g, '<span>$1</span>');
// "<span>This</span> <span>string</span>... <span>it's</span>, <span>an</span>. <span>example</span>! <span><em>string</em></span>?!"

I played around and got this to work :
String toMarkUp = "Each word needs a strong tag around it. I really want to wrap each and every word";
String markedUp = toMarkUp.replaceAll("\\b(\\w+)\\b","<span>$1</span>");
The regex is capturing every word with 1 or more characters (\w+) surrounded by word boundaries, and using forward lookup group to reference it in the replacement with a $1, 1 being the first capture group in the regex.
Output :
<span>Each</span> <span>word</span> <span>needs</span> <span>a</span> <span>strong</span> <span>tag</span> <span>around</span> <span>it</span>. <span>I</span> <span>really</span> <span>want</span> <span>to</span> <span>emphasize</span> <span>each</span> <span>and</span> <span>every</span> <span>word</span>

We Keep Coding

JavaScript is the programming language of the Web.

Javascript regex select word without special characters - javascript

there is no "üst in your text, you mean "üstlenmeyeceğimizin ? any way check this [.\s"]üst\b you can test it here

Related

Trying to write a regex where a newline may appear anywhere in a group

Need a javascript function that doesn't care about special characters in a string

Javascript - regex - word boundary (\b) issue

Capture words not followed by symbol

Regular expression for wrapping every word in span tag on word boundaries in javascript

Categories

Resources