Regex expression that matches a specific pattern but not matches another pattern - javascript

I need to put italic tags around words which are wrapped between underscores, like we do here on stack overflow format options.
I can easily do this by using this regular expression /_(.*?)_/gi. But the thing is that i don't want to put those tags in between email addresses, urls etc. So i need a regex that matches an italic pattern but not matches with url or email pattern.
let urlExp = /(\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|!:,.;]*[-A-Z0-9+&##\/%=~_|])/gi;
let boldExp = /\*(.*?)\*/gi;
let italicExp = /\_(.*?)\_/gi;
let bulletedExp = /^\-\s(.*)/gm;
let emailExp = /([a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6})/gi;
let modifiedText = text
.replace(urlExp, "<a href='$1' target='_blank'>$1</a>")
.replace(emailExp, '$1')
.replace(boldExp, "<strong>$1</strong>")
.replace(italicExp, "<i>$1</i>")
.replace(bulletedExp, "• $1");
return modifiedText;
Here is the code that i am working on. The issue here is that the bold, italic and bullets are also applied on urls and emails, i need to skip these two things.

You can use word boundaries (\b) to determine that we're at the end of a word or words with the underscores, not in the middle.
See regex101 for the example, using this regex:
\b_(.*?)_\b

Just check if there are spaces or beginning or end of string:
let italicExp = /(^|\s)_(.*?)_(\s|$)/gi;
let modifiedText = text.replace(italicExp, "$1<i>$2</i>$3")

Related

Matching Words With or Without Hyphens

So I'm basically trying to match words in a string that may or may not contain hyphens.
For instance, in the strings below:
let firstStr = "filter-table";
let secondStr = "filter-second-table";
"filter-" is a required keyword, and so, I'd want to match the words containing "filter-" followed by any character/word (hyphenated or not).
Using the following:
secondStr.match(/filter-\w+/);
"firstStr" matches correctly but not "secondStr". "secondStr" only matches "filter-second" and not the hyphenated word after - "filter-second-table".
I'd want to be able to match any potential hyphenated word as in "second-table".
Make a group that can be matched multiple times like this: /filter(-\w+)+/
myTest("filter-table");
myTest("filter-second-table");
myTest("filter-awesome-second-table");
function myTest(string){
let matches = string.match(/filter(-\w+)+/)
console.log(string, matches ? "matches!" : "doesn't match")
}

JS split string on positive lookahead, avoid overlapping cases

I have a set of data that includes dated notes, all concatenated, as in the example below. Assume the date always comes at the beginning of its note. I'd like to split these into individual notes. I've used a positive lookahead so I can keep the delimiter (the date).
Here's what I'm doing:
const notes = "[3/28- A note. 3/25- Another note. 3/24- More text. 10/19- further notes. [10/18- Some more text.]"
const pattern = /(?=\d{1,2}\/\d{1,2}[- ]+)/g
console.log(notes.split(pattern))
and the result is
[ '[',
'3/28- A note. ',
'3/25- Another note. ',
'3/24- More text. ',
'1',
'0/19- further notes. [',
'1',
'0/18- Some more text.]' ]
The pattern \d{1,2} matches both 10/19 and 0/19 so it splits before both of those.
Instead I'd like to have
[ '[',
'3/28- A note. ',
'3/25- Another note. ',
'3/24- More text. ',
'10/19- further notes. [',
'10/18- Some more text.]' ]
(I can handle the extraneous brackets later.)
How can I accomplish this split with regex or any other technique?
To get your wanted output, you can prepend a word boundary in the lookahead, and you can omit the plus sign at the end of the pattern.
(?=\b\d{1,2}\/\d{1,2}[- ])
Regex demo
const notes = "[3/28- A note. 3/25- Another note. 3/24- More text. 10/19- further notes. [10/18- Some more text.]"
const pattern = /(?=\b\d{1,2}\/\d{1,2}[- ])/g
console.log(notes.split(pattern))
I would avoid split() here and instead use match():
var notes = "[3/28- A note. 3/25- Another note. 3/24- More text. 10/19- further notes. [10/18- Some more text.]";
var matches = notes.match(/\[?\d+\/\d+\s*-\s*.*?\.\]?/g);
console.log(matches);
You may do a further cleanup of leading/trailing brackets using regex, e.g.
var input = "[10/18- Some more text.]";
var output = input.replace(/^\[|\]$/, "");
Try .replaceAll() and this regex:
/(\[?\d{1,2}\/\d{1,2}\-.+?)/
// Replacement
"\n$1"
Figure I - Regex
Segment
Description
(\[?
Begin capture group - match literal "[" zero or one time
\d{1,2}\/
match a digit one or two times and a literal "/"
\d{1,2}\-
match a digit one or two times and a literal "-"
.+?)
match anything one to any number of times "lazily" - end capture group
Figure II - Replacement
Segment
Description
\n
New line
$1
Everything matched in the capture group (...)
const notes = "[3/28- A note. 3/25- Another note 3/24- More text. 10/19- further notes [10/18- Some more text.]";
const rgx = new RegExp(/(\[?\d{1,2}\/\d{1,2}\-.+?)/, 'g');
let result = notes.replaceAll(rgx, "\n$1");
console.log(result);

Extract text containing match between new line characters

I am trying to extract paragraphs from OCR'd contracts if that paragraph contains key search terms using JS. A user might search for something such as "ship ahead" to find clauses relating to whether a certain customers orders can be shipped early.
I've been banging my head up against a regex wall for quite some time and am clearly just not grasping something.
If I have text like this and I'm searching for the word "match":
let text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want."
I would want to extract all the text between the double \n characters and not return the second sentence in that string.
I've been trying some form of:
let string = `[^\n\n]*match[^.]*\n\n`;
let re = new RegExp(string, "gi");
let body = text.match(re);
However that returns null. Oddly if I remove the periods from the string it works (sorta):
[
"This is an example of a paragraph that has the word I'm looking for The word is Match \n" +
'\n'
]
Any help would be awesome.
Extracting some text between identical delimiters containing some specific text is not quite possible without any hacks related to context matching.
Thus, you may simply split the text into paragraphs and get those containing your match:
const results = text.split(/\n{2,}/).filter(x=>/\bmatch\b/i.test(x))
You may remove word boundaries if you do not need a whole word match.
See the JavaScript demo:
let text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want.";
console.log(text.split(/\n{2,}/).filter(x=>/\bmatch\b/i.test(x)));
That's pretty easy if you use the fact that a . matches all characters except newline by default. Use regex /.*match.*/ with a greedy .* on both sides:
const text = 'aaaa\n\nbbb match ccc\n\nddd';
const regex = /.*match.*/;
console.log(text.match(regex).toString());
Output:
bbb match ccc
Here is two ways to do it. I am not sure why u need to use regular expression. Split seems much easier to do, isn't it?
const text = "\n\nThis is an example of a paragraph that has the word I'm looking for The word is Match. \n\nThis paragraph does not have the word I want."
// regular expression one
function getTextBetweenLinesUsingRegex(text) {
const regex = /\n\n([^(\n\n)]+)\n\n/;
const arr = regex.exec(text);
if (arr.length > 1) {
return arr[1];
}
return null;
}
console.log(`getTextBetweenLinesUsingRegex: ${ getTextBetweenLinesUsingRegex(text)}`);
console.log(`simple: ${text.split('\n\n')[1]}`);

Match string in between two strings [duplicate]

This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 3 years ago.
If I have a string like this:
var str = "play the Ukulele in Lebanon. play the Guitar in Lebanon.";
I want to get the strings between each of the substrings "play" and "in", so basically an array with "the Ukelele" and "the Guitar".
Right now I'm doing:
var test = str.match("play(.*)in");
But that's returning the string between the first "play" and last "in", so I get "the Ukulele in Lebanon. Play the Guitar" instead of 2 separate strings. Does anyone know how to globally search a string for all occurrences of a substring between a starting and ending string?
You can use the regex
play\s*(.*?)\s*in
Use the / as delimiters for regex literal syntax
Use the lazy group to match minimal possible
Demo:
var str = "play the Ukulele in Lebanon. play the Guitar in Lebanon.";
var regex = /play\s*(.*?)\s*in/g;
var matches = [];
while (m = regex.exec(str)) {
matches.push(m[1]);
}
document.body.innerHTML = '<pre>' + JSON.stringify(matches, 0, 4) + '</pre>';
You are so close to the right answer. There are a few things you may be overlooking:
You need your match to be non-greedy, this can be accomplished by using the ? operator
Do not use the String.match() method as it's proven to match the entirety of the pattern and does not pay attention to capturing groups as you would expect. An alternative is to use RegExp.exec() or String.replace(), but using replace would require a little more work, so stick to building your own array with exec
var str = "display the Ukulele in Lebanon. play the Guitar in Lebanon.";
var re = /\bplay (.+?) in\b/g;
var matches = [];
var match;
while ( match = re.exec(str) ){
matches[ matches.length ] = match[1];
}
document.getElementById('demo').innerHTML = JSON.stringify( matches );
<pre id="demo"></pre>
/\bplay\s+(.+?)\s+in\b/ig might be more specific and might work better for you.
I believe there may be some issues with the regexes offered previously. For instance, /play\s*(.*?)\s*in/g will find a match within "displaying photographs in sequence". Of course this is not what you want. One of the problems is that there is nothing specifying that "play" should be a discrete word. It needs a word boundary before it and at least one instance of white space after it (it can't be optional). Similarly, the white space after the capture group should not be optional.
The other expression offered at the time I added this, /play (.+?) in/g, lacks the word boundary token before "play" and after "in", so it will contain a match in "display blue ink". This is not what you want.
As to your expression, it was missing the word boundary and white space tokens as well. But as another mentioned, it also needed the wildcard to be lazy. Otherwise, given your example string, your match would start with the first instance of "play" and end with the 2nd instance of "in".
If issues with my offered expression are found, would appreciate feedback.
A victim of greedy matching.
.* finds the longest possible match,
while .*? finds the shortest possible match.
For the example given str will be an array or 3 strings containing:
the Ukelele
the Guitar
Lebanon

regex lookbehind in javascript

i im trying to match some words in text
working example (what i want) regex101:
regex = /(?<![a-z])word/g
text = word 1word !word aword
only the first three words will be matched which is what i want to achieve.
but the look behind will not work in javascript :(
so now im trying this regex101:
regex = /(\b|\B)word/g
text = word 1word !word aword
but all words will match and they may not be preceded with an other letter, only with an integer or special characters.
if i use only the smaller "\b" the 1word wont matchand if i only use the "\B" the !word will not match
Edit
The output should be ["word","word","word"]
and the 1 ! must not be included in the match also not in another group, this is because i want to use it with javascript .replace(regex,function(match){}) which should not loop over the 1 and !
The code i use it for
for(var i = 0; i < elements.length; i++){
text = elements[i].innerHTML;
textnew = text.replace(regexp,function(match){
matched = getCrosslink(match)[0];
return "<a href='"+matched.url+"'>"+match+"</a>";
});
elements[i].innerHTML = textnew;
}
Capturing the leading character
It's difficult to know exactly what you want without seeing more output examples, but what about looking for either starts with boundary or starts with a non-letter. Like this for example:
(\bword|[^a-zA-Z]word)
Output: ['word', '1word', '!word']
Here is a working example
Capturing only the "word"
If you only want the "word" part to be captured you can use the following and fetch the 2nd capture group:
(\b|[^a-zA-Z])(word)
Output: ['word', 'word', 'word']
Here is a working example
With replace()
You can use specific capture groups when defining the replace value, so this will work for you (where "new" is the word you want to use):
var regex = /(\b|[^a-zA-Z])(word)/g;
var text = "word 1word !word aword";
text = text.replace(regex, "$1" + "new");
output: "new 1new !new aword"
Here is a working example
If you are using a dedicated function in replace, try this:
textnew = text.replace(regexp,function (allMatch, match1, match2){
matched = getCrosslink(match2)[0];
return "<a href='"+matched.url+"'>"+match2+"</a>";
});
Here is a working example
You can use the following regex
([^a-zA-Z]|\b)(word)
Simply use replace like as
var str = "word 1word !word aword";
str.replace(/([^a-zA-Z]|\b)(word)/g,"$1"+"<a>$2</a>");
Regex

Categories