Replacing "[aA09.b]." to "[aA09.b]\n" in in JavaScript - javascript

I have this string:
var inputString = "Some text [ some text here . some more text] . some sentence."
The . should be replaced by \n provided it isn't in between [ ].
Expected result:
"Some text [ some text here . some more text] \n some sentence\n"
I think a quick regular expression could help, but I'm not sure where to start. Any ideas?

Assuming brackets do not contain more brackets, you can use replace with a callback:
var s = inputString.replace(/(\[[^\]]*\])|(\.)/g,
function(g0,brackets,dot){ return brackets || '\n';}
);
The regex captures the brackets (\[[^\]]*\]) so it can be used for replacing when it was captures, and used \n when \. was captured.
Essentially, this is "skipping" over dots inside brackets.

Kobi's single regex replace solution is short, simple, fast, accurate and elegant. It correctly handles strings with non-nested bracketed structures.
JavaScript Solution [for [Nested] Brackets]
To correctly handle nested brackets using JavaScript, a more complex iterative solution is required. Since JavaScript regex syntax does not provide recursive expressions, it is impossible to match the outermost pair of matching brackets when the brackets are nested. However, it is quite easy to write a regex which correctly matches an innermost pair of matching brackets:
/\[([^[\]]*)\]/g
The tested JavaScript function below handles nested structures by iteratively matching the innermost brackets from the inside out, "hiding" the bracket and dot characters as it goes. (The square bracket and dot characters are temporarily replaced with their equivalent HTML entities.) Once all the dots within the (possibly nested) bracketed text have been "hidden", all the remaining dots in the string, (which fall outside the brackets), are replaced with line feeds. Once this is complete, all the temporarily hidden characters are restored. Since HTML entities are being used internally by this function as temporary placeholders, any pre-existing HTML entities that were in the original string are preserved at the start and then restored at the end.
function replaceDotsNotInBrackets(text) {
// Regex to match innermost brackets capturing contents in $1.
var re_inner_brackets = /\[([^[\]]*)\]/g;
// Firstly, hide/protect any/all existing html entities.
text = text.replace(/&/g, "&");
// Iteratively "Hide" dots within brackets from inside out.
// Hide dots and brackets by converting to decimal entities:
// Change [ to [
// Change ] to ]
// Change . to .
while (text.search(re_inner_brackets) !== -1) {
text = text.replace(re_inner_brackets,
function(m0, m1){
return "["+ m1.replace(/\./g, ".") +"]";
});
} // All matching brackets and contained dots are now "hidden".
// Replace all dots outside of brackets with a linefeed.
text = text.replace(/\./g, "\n");
// Unhide all previously hidden brackets and dots.
text = text.replace(/&#(?:91|46|93);/g,
function(m0){
return {"[": "[", ".": ".", "]": "]"}[m0];
});
// Lastly, restore previously existing html entities.
return text.replace(/&/g, "&");
}

Related

Regex excluding matches wrapped in specific bbcode tags

I'm trying to replace double quotes with curly quotes, except when the text is wrapped in certain tags, like [quote] and [code].
Sample input
[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>"Why no goodbye?" replied [b]Bob[/b]. "It's always Hello!"</p>
Expected output
[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>“Why no goodbye?” replied [b]Bob[/b]. “It's always Hello!”</p>
I figured how to elegantly achieve what I want in PHP by using (*SKIP)(*F), however my code will be run in javascript, and the javascript solution is less than ideal.
Right now I'm splitting the string at those tags, running the replace, then putting the string together:
var o = 3;
a = a
.split(/(\[(?<first>(?:icode|quote|code))[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(?:\k<first>)\])/i)
.map(function(x,i) {
if (i == o-1 && x) {
x = '';
}
else if (i == o && x)
{
x = x.replace(/(?![^<]*>|[^\[]*\])"([^"]*?)"/gi, '“$1”')
o = o+3;
}
return x;
}).join('');
Javascript Regex Breakdown
Inside split():
(\[(?<first>icode|quote|code)[^\]]*?\](?:.)*?\[\/(\k<first>)\]) - captures the pattern inside parentheses:
\[(?<first>quote|code|icode)[^\]]*?\] - a [quote], [code], or [icode] opening tag, with or without parameters like =html, eg [code=html]
(?:[\s]*?.)*? - any 0+ (as few as possible) occurrences of any char (.), preceded or not by whitespace, so it doesn't break if the opening tag is followed by a line break
[\s]*? - 0+ whitespaces
\[\/(\k<first>)\] - [\quote], [\code], or [\icode] closing tags. Matches the text captured in the (?<first>) group. Eg: if it's a quote opening tag, it'll be a quote closing tag
Inside replace():
(?![^<]*>|[^\[]*\])"([^"]*?)" - captures text inside double quotes:
(?![^<]*>|[^\[]*\]) - negative lookahead, looks for characters (that aren't < or [) followed by either > or ] and discards them, so it won't match anything inside bbcode and html tags. Eg: [spoiler="Name"] or <span style="color: #24c4f9">. Note that matches wrapped in tags are left untouched.
" - literal opening double quotes character.
([^"]*?) - any 0+ character, except double quotes.
" - literal closing double quotes character.
SPLIT() REGEX DEMO: https://regex101.com/r/Ugy3GG/1
That's awful, because the replace is executed multiple times.
Meanwhile, the same result can be achieved with a single PHP regex. The regex I wrote was based on Match regex pattern that isn't within a bbcode tag.
(\[(?<first>quote|code|icode)[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(\k<first>)\])(*SKIP)(*F)|(?![^<]*>|[^\[]*\])"([^"]*?)"
PHP Regex Breakdown
(\[(?<first>quote|code|icode)[^\]]*?\](?:[\s]*?.)*?[\s]*?\[\/(\k<first>)\])(*SKIP)(*F) - matches the pattern inside capturing parentheses just like javascript split() above, then (*SKIP)(*F) make the regex engine omit the matched text.
| - or
(?![^<]*>|[^\[]*\])"([^"]*?)" - captures text inside double quotes in the same way javascript replace() does
PHP DEMO: https://regex101.com/r/fB0lyI/1
The beauty of this regex is that it only needs to be run once. No splitting and joining of strings. Is there a way to implement it in javascript?
Because JS lacks backtracking verbs you will need to consume those bracketed chunks but later replace them as is. By obtaining the second side of the alternation from your own regex the final regex would be:
\[(quote|i?code)[^\]]*\][\s\S]*?\[\/\1\]|(?![^<]*>|[^\[]*\])"([^"]*)"
But the tricky part is using a callback function with replace() method:
str.replace(regex, function($0, $1, $2) {
return $1 ? $0 : '“' + $2 + '”';
})
Above ternary operator returns $0 (whole match) if first capturing group exists otherwise it encloses second capturing group value in curly quotes and returns it.
Note: this may fail in different cases.
See live demo here
Nested markup is hard to parse with rx, and JS's RegExp in particular. Complex regular expressions also hard to read, maintain, and debug. If your needs are simple, a tag content replacement with some banned tags excluded, consider a simple code-based alternative to run-on RegExps:
function curly(str) {
var excludes = {
quote: 1,
code: 1,
icode: 1
},
xpath = [];
return str.split(/(\[[^\]]+\])/) // breakup by tag markup
.map(x => { // for each tag and content:
if (x[0] === "[") { // tag markup:
if (x[1] === "/") { // close tag
xpath.pop(); // remove from current path
} else { // open tag
xpath.push(x.slice(1).split(/\W/)[0]); // add to current path
} //end if open/close tag
} else { // tag content
if (xpath.every(tag =>!excludes[tag])) x = x.replace(/"/g, function repr() {
return (repr.z = !repr.z) ? "“" : "”"; // flip flop return value (naive)
});
} //end if markup or content?
return x;
}) // end term map
.join("");
} /* end curly() */
var input = `[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>"Why no goodbye?" replied [b]Bob[/b]. "It's always Hello!"</p>`;
var wants = `[quote="Name"][b]Alice[/b] said, "Hello world!"[/quote]
<p>“Why no goodbye?” replied [b]Bob[/b]. “It's always Hello!”</p>`;
curly(input) == wants; // true
To my eyes, even though it a bit longer, code allows documentation, indentation, and explicit naming that makes these sort of semi-complicated logical operations easier to understand.
If your needs are more complex, use a true BBCode parser for JavaScript and map/filter/reduce it's model as needed.

How to write regexp for finding :smile: in javascript?

I want to write a regular expression, in JavaScript, for finding the string starting and ending with :.
For example "hello :smile: :sleeping:" from this string I need to find the strings which are starting and ending with the : characters. I tried the expression below, but it didn't work:
^:.*\:$
My guess is that you not only want to find the string, but also replace it. For that you should look at using a capture in the regexp combined with a replacement function.
const emojiPattern = /:(\w+):/g
function replaceEmojiTags(text) {
return text.replace(emojiPattern, function (tag, emotion) {
// The emotion will be the captured word between your tags,
// so either "sleep" or "sleeping" in your example
//
// In this function you would take that emotion and return
// whatever you want based on the input parameter and the
// whole tag would be replaced
//
// As an example, let's say you had a bunch of GIF images
// for the different emotions:
return '<img src="/img/emoji/' + emotion + '.gif" />';
});
}
With that code you could then run your function on any input string and replace the tags to get the HTML for the actual images in them. As in your example:
replaceEmojiTags('hello :smile: :sleeping:')
// 'hello <img src="/img/emoji/smile.gif" /> <img src="/img/emoji/sleeping.gif" />'
EDIT: To support hyphens within the emotion, as in "big-smile", the pattern needs to be changed since it is only looking for word characters. For this there is probably also a restriction such that the hyphen must join two words so that it shouldn't accept "-big-smile" or "big-smile-". For that you need to change the pattern to:
const emojiPattern = /:(\w+(-\w+)*):/g
That pattern is looking for any word that is then followed by zero or more instances of a hyphen followed by a word. It would match any of the following: "smile", "big-smile", "big-smile-bigger".
The ^ and $ are anchors (start and end respectively). These cause your regex to explicitly match an entire string which starts with : has anything between it and ends with :.
If you want to match characters within a string you can remove the anchors.
Your * indicates zero or more so you'll be matching :: as well. It'll be better to change this to + which means one or more. In fact if you're just looking for text you may want to use a range [a-z0-9] with a case insensitive modifier.
If we put it all together we'll have regex like this /:([a-z0-9]+):/gmi
match a string beginning with : with any alphanumeric character one or more times ending in : with the modifiers g globally, m multi-line and i case insensitive for things like :FacePalm:.
Using it in JavaScript we can end up with:
var mytext = 'Hello :smile: and jolly :wave:';
var matches = mytext.match(/:([a-z0-9]+):/gmi);
// matches = [':smile:', ':wave:'];
You'll have an array with each match found.

Regex example to match pseudo element's content property

I am trying to parse the pseudo selector content in javascript.
Html content can be
content: counter(item)" " attr(data) "" counter(item1,decimal) url('test.jpeg') "hi" attr(xyz);
To parse this content i am using below regex (logic of matching parenthesis copied from internet )
counter\((?:[^)(]+|\((?:[^)(]+|\([^)(]*\))*\))*\)
This selects all the counter with "(" but counter can not have nested parentheses (as far as i know, correct me if i am wrong).Similarly same regex i am using to select other content also.
Attr : attr\((?:[^)(]+|\((?:[^)(]+|\([^)(]*\))*\))*\)
Quotes: openQuote\((?:[^)(]+|\((?:[^)(]+|\([^)(]*\))*\))*\)
String: anything inside double/single quotes: (current regex is not working ".*")
I have below questions here
1. Regex to match single parenthesis (no nested parenthesis is possible in pseudo selector content property)
2.Single regex that will match the counter, attribute , url and string content in the given order (order is important because i want to replace them later with evaluated values)
Please let me know if any more information is required from side.
Thanks
Your first regex does indeed match nested parentheses (but not escaped parentheses). Is that desirable?
Without nesting or escaping, these become much simpler.
Here's a variant of your first regex that ignores nesting possibilities:
counter\([^)]*\)
It matches a literal counter( and then zero or more non-close-parentheses, then finally a close parenthesis. (Full explanations of your first regex and my simpler version at regex101.)
I believe that answers your first question, though if you're literally looking for a "regex to match [a] single parenthesis," that's just [()], which will match either an open or a close parenthesis character. You could alternatively explicitly match \( or \) if you know which one you want to match.
Matching quotes (without regard to nesting or escaped quotes) is similarly easy:
"[^"]*"
This matches a literal double quote character ("), then zero or more non-doublequote characters, then another literal double quote character.
Your second request was for a "single regex that will match the counter, attribute , url and string content in the given order (order is important because i want to replace them later with evaluated values)."
I'm not sure how you intend to get the CSS content property's value, given how that's typically in an ::after or ::before pseudo-class, which are not available from the DOM, but here's some dummy code populating it so we can manipulate it:
var css = `content: counter(item)" " attr(data) "" counter(item1,decimal) url('test.jpeg') "hi" attr(xyz); color:red;`;
// harvest last `content` property (this is tricked by `content: "content: blah"`)
var content = css.match(/.*\bcontent:\s*([^;"']*(?:"[^"]*"[^;"']*|'[^']*'[^;"']*)*)/);
if (content) {
var part_re = /(?:"([^"]*)"|'([^']*)'|(?:counter|attr|url)\(([^)]*)\))/g;
while ( part = part_re.exec(content[1]) ) { // parse on just the value
if (part[0].match(/^"/)) { /* do stuff to part[1] */ }
else if (part[0].match(/^'/)) { /* do stuff to part[2] */ }
else if (part[0].match(/^counter/)) { /* do stuff to part[3] */ }
else if (part[0].match(/^attr/)) { /* do stuff to part[3] */ }
else if (part[0].match(/^url/)) { /* do stuff to part[3] */ }
// silently skips other values, like `open-quote` or `counters(name, string)`
}
}
The first regex (line 4) extracts the last content property from the CSS (last because it'll override previous instances, though note the fact that this'll stupidly extract content: blah from content: "content: blah"). After finding the last instance of a word break and then content:, it absorbs any whitespace and then matches the rest of the line until a semicolon, double quote, or single quote. A non-capture group allows for any content between double quotes or a single quote, much in the same way we matched quotes near the top of this answer. (Full explanation of this CSS content regex at regex101.)
The second regex (line 7, assigned to part_re) is in a while loop so we can work on each individual value in the content property in order. It matches double-quoted strings or single-quoted strings or certain named values (counter or attr or url). See the conditionals and comments for where the values' data are stored. Full explanation of this value parsing regex at regex101 (see "Match Information" in the middle of the right column to see how I'm storing the values' data).

Validate a search expression with quotes and *

I'm trying to write a regular expression to validate some user entered text. I'd like to use the regular expression with an ngPattern directive so should avoid using the g flag.
Essentially, there are a number of "simple" rules.
There must be one or more words.
Single quotes (') are not allowed.
Double quotes (") are allowed but must be paired, i.e. open and closing.
Paired double quotes must wrap one or more words.
No white space is allowed between a double quote and the word it adjacently wraps.
An asterisk (*) is not allowed unless it immediately precedes a closing double quote and follows a word, without whitespace.
Here are some examples.
example match
'' false
' ' false
' foo' true
'foo' true
'foo bar' true
'foo bar*' false
'"foo' false
'"foo"' true
'" foo"' false
'"foo "' false
'"foo bar"' true
'"foo *"' false
'"foo*"' true
'foo*"' false
'"foo*" "bar*"' true
'foo "bar*"' true
'"foo* bar"' false
'"foo*" bar' true
I've created unit tests here
I'm struggling to get anywhere close,
I've got an expression like this
/(")(?:(?=(\\?))\2.)*?\1/
that will match text between paired double quotes. Something like this,
/^.*\*"$/
will match text that ends with '*"',
as you can see, I've got a long way to go, please help.
Is it possible that a regular expression is the wrong way to do this?
^(?=.*\b)(?=[^"]*("[^"]*"[^"]*)*$)(?![^"]*("[^"]*"[^"]*)*" *")(?!.*\*[^"])(?!.*[ "]\*)(?![^"]*("[^"]*"[^"]*)*[^"]*\*")(?![^"]*("[^"]*"[^"]*)*" \w)(?![^"]*("[^"]*"[^"]*)*"[^"]*\w ")'[^']*'$
See it in action
Good luck using this in your production codebase
Ok, so dafuq...
An important idea that we are going to reuse is how to reach a position, before which you know there were an even number of "s. Namely:
[^"]*("[^"]*"[^"]*)*
Unfortunately, we can't reuse patterns in javascript regexes, so we will have to repeat it where ever we need it. Namely:
Double quotes (") are allowed but must be paired, i.e. open and closing.
^(?=__even_quotes_pattern__$)
Basically, we say that from the start (^), when we iterate til the end ($) we match the said pattern, aka even number of ".
No white space is allowed between a double quote and the word it adjacently wraps.
We will split this in two parts - doesn't happen on the left, doesn't happen on the right:
^(?!__even_quotes_pattern__" \w)
^(?!__even_quotes_pattern__\w ")
Paired double quotes must wrap one or more words.
^(?!__even_quotes_pattern__" *")
(there are no paired quotes that wrap only spaces)
The rest of them are easier:
There must be one or more words.
^(?=.*\b)
(at some point there is a word boundary (\b))
Single quotes (') are not allowed.
(or from the interpretation in the comments, not allowed except for the ones that wrap the string)
^'[^']*'$
An asterisk (*) is not allowed unless it immediately precedes a closing double quote and follows a word, without whitespace.
We will split this into three parts:
(1) Must precede a ":
(?!.*\*[^"])
(2) Must follow a non-" or space
(?!.*[ "]\*)
(3) It doesn't precede non-closing ":
(?!__even_quotes_pattern__[^"]*\*")
clean and simple function:
function myParser(string) {
var string = string.trim(),
wordsArray = string.split(' '),
regExp = /((?=")^["][a-zA-Z]+[*]{0,1}["]$|^[a-zA-Z]+$)/,
len = wordsArray.length,
i = 0;
// '"foo bar"' situation
if (string.match(/^["][a-zA-Z]+[\s]?[a-zA-Z]+[*]{0,1}["]$/)) {
return true;
}
for (i; i < len; i++) {
var result = wordsArray[i].match(regExp);
if (result === null) {
return false;
}
}
return true;
}
https://jsfiddle.net/cy9ozmdm/ to check results.
If you need explanations - write in comment, I will write logic detailed.
(Idea for you: check 2 variants (clear regExp and function) - on 10.000 test situation - what works faster (and doesn't fail :))?)

What does this JS do?

var passwordArray = pwd.replace(/\s+/g, '').split(/\s*/);
I found the above line of code is a rather poorly documented JavaScript file, and I don't know exactly what it does. I think it splits a string into an array of characters, similar to PHP's str_split. Am I correct, and if so, is there a better way of doing this?
it replaces any spaces from the password and then it splits the password into an array of characters.
It is a bit redundant to convert a string into an array of characters,because you can already access the characters of a string through brackets(.. not in older IE :( ) or through the string method "charAt" :
var a = "abcdefg";
alert(a[3]);//"d"
alert(a.charAt(1));//"b"
It does the same as: pwd.split(/\s*/).
pwd.replace(/\s+/g, '').split(/\s*/) removes all whitespace (tab, space, lfcr etc.) and split the remainder (the string that is returned from the replace operation) into an array of characters. The split(/\s*/) portion is strange and obsolete, because there shouldn't be any whitespace (\s) left in pwd.
Hence pwd.split(/\s*/) should be sufficient. So:
'hello cruel\nworld\t how are you?'.split(/\s*/)
// prints in alert: h,e,l,l,o,c,r,u,e,l,w,o,r,l,d,h,o,w,a,r,e,y,o,u,?
as will
'hello cruel\nworld\t how are you?'.replace(/\s+/g, '').split(/\s*/)
The replace portion is removing all white space from the password. The \\s+ atom matches non-zero length white spcace. The 'g' portion matches all instances of the white space and they are all replaced with an empty string.

Categories