Javascript regex: How to find length of a match?

Javascript regex: How to find length of a match? - javascript

I got a task to write code highlighter for C#. Everything's pretty good, but I wish to optimize indentation. So, I have a regexp looking like /(\t|[ ]{4})/g, so I replace tabulation or 4 space chars with <span style="margin-left: 2em;" /> and it looks good, but it creates a lot of unnecessary spans. I want to use something like /^[ ]{x}/g and replace with <span style='margin-left: "+(0.5*x)+"em;' /> to have only one span per line with appropriate margin. str.match() won't work because it searches in all document, not per line.

If your regular expression has the g flag, you can execute it over and over again, getting all matches from the string, including the length of the match:
var re = /^(\t|[ ]{4})/g;
var match;
while((match = re.exec(text)) {
// use match.index and match[0].length
}

Related

Repeat a javascript replace until no change is made?

I have a short but complex regular expression to trim spaces regardless of html tags present in the string.
var text = "<span><span>ex ample </span> </span>";
// trim from start; not relevant in this example
text = text.replace(/^((<[^>]*>)*)\s+/g, "$1");
// trim from end
text = text.replace(/\s+((<[^>]*>)*)$/g, "$1");
console.log(text);
<span><span>ex ample </span> </span> - example input
<span><span>ex ample</span></span> - expected output
<span><span>ex ample </span></span> - observed output
How do I achieve my expected output?
I've tried adding the /g flag because it should supposedly match more than once and that should fix it (running the replace twice does work for the example) but it doesn't seem to repeat anything at all.
Alternative ways to trim strings regardless of tags are also appreciated because that is my primary objective. The secondary objective is learning why this didn't work.

You need to add some meaning to your tags, some need their spaces, some don't.
Try this:
text.replace(/\s*(<\/?(span|div)>)\s*/g, "$1")
.trim()
.replace(/\s+/g, ' ');
It:
replaces spaces around tags "surrounding" content
trims spaces around global string
removes redundant spaces
The list of "surrounding" tags can be changed to include things like tr...
Steps 2 and 3 might come first to speed things up.
Tried it with:
var text = "<div> <i>ano</i> <b>ther</b> <span> <b>my</b> <i>ex</i> <u> ample </u> </span> </div>";
First answer, prior to comments.
The idea is to remove all spaces between:
a non-space character and an opening tag
a closing tag and a non-space character
text.replace(/([^\s])\s*(<)/g, "$1$2")
.replace(/([>])\s*([^\s])/g, "$1$2")
.trim();

Preamble: don't just copy this, read to the end.
Thinking from the other way around - by replacing until no match is found instead of until no change is made, this seems to work very simply.
var text = "<span><span>ex ample </span> </span>";
var trim_start = /^((<[^>]*>)*)\s+/;
while(text.match(trim_start)) {
text = text.replace(trim_start, "$1");
}
var trim_end = /\s+((<[^>]*>)*)$/;
while (text.match(trim_end)) {
text = text.replace(trim_end, "$1");
}
console.log(text);
The output is as expected - the only space is between ex ample
But this has a big problem if the replace might not change anything. Simply changing \s+ to \s* makes it turn into an infinite cycle. So, all in all, it works for my case but is not robust and to use it, you must be completely sure every single replace will change something when the regex matches.

Remove everything after constant using regex

I've got XML that has additional information, BLAH, in each tag. When creating the tags, I've separated the extra info from the tag name with a constant (XMLSPLIT as constant XML_SPLITTER)... I needed to do this because I'm generating my XML from a JSON object and I can't have multiple keys that are the same thing... but in the XML output, can't have that superfluous stuff.
For example:
....
<SetXMLSPLITBLAH>
<Value>9</Value>
<SetType>
<Name>Foo</Name>
</SetType>
</SetXMLSPLITBLAH>
...
So, after generating the XML, I go through and clean it. I'm trying to do it with a regex. I figure, I want to remove anything on a line after the splitter and replace it with just the >.
let reg = new RegExp("<Set"+XML_SPLITTER+"(.*)\/g");
cleanXML = dirtyXML.replace(reg, "<Set>")
This fails to work.
I will note, that I reg = /<Set(.*)/g; and that worked just fine... but it also captures "SetType" and any other use of a tag that starts with "

It's because ^ is a Regex special character that indicates "beginning of line". You'd need to escape it like \^ for this to work. Something like /<Set\^\^[^>]*>/g should do the trick.
Small note: The above regex assumes that the "BLAH" string in your example will never contain the > character... but if it does, then your XML is super malformed anyway.

Using .* will match > and if - for some reason - your XML file is not broken up into multiple lines (i.e. minified), you'll match more than you should. To avoid this, you can use [^>]* to match everything up to the >.
Since you've gracefully included a splitter, it'll make matching much easier and much more predictable (as you mentioned, you match SetType without a splitter).
Without a splitter, you'd have to use a regex pattern that resembles <Set(?!Type>)[^>]* or <Set(?!(?:Type|SomethingElse)>)[^>]* if you had more than just one suffix to Set that should remain. These methods use a negative lookahead to assert what follows does not match.
var str = `<SetXMLSPLITBLAH>
<Value>9</Value>
<SetType>
<Name>Foo</Name>
</SetType>
</SetXMLSPLITBLAH>`
var XML_SPLITTER = 'XMLSPLIT'
var p = `(</?)Set${XML_SPLITTER}[^>]*`
var r = new RegExp(p,'g')
x = str.replace(r,'$1Set')
console.log(x)

How to write regexp for finding :smile: in javascript?

I want to write a regular expression, in JavaScript, for finding the string starting and ending with :.
For example "hello :smile: :sleeping:" from this string I need to find the strings which are starting and ending with the : characters. I tried the expression below, but it didn't work:
^:.*\:$

My guess is that you not only want to find the string, but also replace it. For that you should look at using a capture in the regexp combined with a replacement function.
const emojiPattern = /:(\w+):/g
function replaceEmojiTags(text) {
return text.replace(emojiPattern, function (tag, emotion) {
// The emotion will be the captured word between your tags,
// so either "sleep" or "sleeping" in your example
//
// In this function you would take that emotion and return
// whatever you want based on the input parameter and the
// whole tag would be replaced
//
// As an example, let's say you had a bunch of GIF images
// for the different emotions:
return '<img src="/img/emoji/' + emotion + '.gif" />';
});
}
With that code you could then run your function on any input string and replace the tags to get the HTML for the actual images in them. As in your example:
replaceEmojiTags('hello :smile: :sleeping:')
// 'hello <img src="/img/emoji/smile.gif" /> <img src="/img/emoji/sleeping.gif" />'
EDIT: To support hyphens within the emotion, as in "big-smile", the pattern needs to be changed since it is only looking for word characters. For this there is probably also a restriction such that the hyphen must join two words so that it shouldn't accept "-big-smile" or "big-smile-". For that you need to change the pattern to:
const emojiPattern = /:(\w+(-\w+)*):/g
That pattern is looking for any word that is then followed by zero or more instances of a hyphen followed by a word. It would match any of the following: "smile", "big-smile", "big-smile-bigger".

The ^ and $ are anchors (start and end respectively). These cause your regex to explicitly match an entire string which starts with : has anything between it and ends with :.
If you want to match characters within a string you can remove the anchors.
Your * indicates zero or more so you'll be matching :: as well. It'll be better to change this to + which means one or more. In fact if you're just looking for text you may want to use a range [a-z0-9] with a case insensitive modifier.
If we put it all together we'll have regex like this /:([a-z0-9]+):/gmi
match a string beginning with : with any alphanumeric character one or more times ending in : with the modifiers g globally, m multi-line and i case insensitive for things like :FacePalm:.
Using it in JavaScript we can end up with:
var mytext = 'Hello :smile: and jolly :wave:';
var matches = mytext.match(/:([a-z0-9]+):/gmi);
// matches = [':smile:', ':wave:'];
You'll have an array with each match found.

regex encapsulation

I've got a question concerning regex.
I was wondering how one could replace an encapsulated text, something like {key:23} to something like <span class="highlightable">23</span, so that the entity will still remain encapsulated, but with something else.
I will do this in JS, but the regex is what is important, I have been searching for a while, probably searching for the wrong terms, I should probably learn more about regex, generally.
In any case, is there someone who knows how to perform this operation with simplicity?
Thanks!

It's important that you find {key:23} in your text first, and then replace it with your wanted syntax, this way you avoid replacing {key:'sometext'} with that syntax which is unwanted.
var str = "some random text {key:23} some random text {key:name}";
var n = str.replace(/\{key:[\d]+\}/gi, function myFunction(x){return x.replace(/\{key:/,'<span>').replace(/\}/, '</span>');});
this way only {key:AnyNumber} gets replaced, and {key:AnyThingOtherThanNumbers} don't get touched.

It seems you are new to regex. You need to learn more about character classes and capturing groups and backreferences.
The regex is somewhat basic in your case if you do not need any nested encapsulated text support.
Let's start:
The beginning is {key: - it will match the substring literally. Note that { can be a special character (denoting start of a limiting quantifier), thus, it is a good idea to escape it: {key:.
([^}]+) - This is a bit more interesting: the round brackets around are a capturing group that let us later back-reference the matched text. The [^}]+ means 1 or more characters (due to +) other than } (as [^}] is a negated character class where ^ means not)
} matches a } literally.
In the replacement string, we'll get the captured text using a backreference $1.
So, the entire regex will look like:
{key:([^}]+)}
See demo on regex101.com
Code snippet:
var re = /{key:([^}]+)}/g;
var str = '{key:23}';
var subst = '<span class="highlightable">$1</span>';
document.getElementById("res").innerHTML = str.replace(re, subst);
.highlightable
{
color: red;
}
<div id="res"/>
If you want to use a different behavior based on the value of key, then you'll need to adjust the regex to either match digits only (with \d+) or letters only (say, with [a-zA-Z] for English), or other shorthand classes, ranges (= character classes), or their combinations.

If your string is in var a, then:
var test = a.replace( /\{key:(\d+)\}/g, "<span class='highlightable'>$1</span>");

Javascript RegExp Matching weirdness

I have a RegExp:
/.?(NCAA|Division|I|Basketball|Champions,|1939-2011).?/gi
and some text "Champion"
somehow, this is coming back as a match, am I crazy?
0: "pio"
1: "i"
index: 4
input: "Champion"
length: 2
the loop is here:
// contruct the pattern, dynamically
var someText = "Champion";
var phrase = ".?(NCAA|Division|I|Basketball|Champions,|1939-2011).?";
var pat = new RegExp(phrase, "gi"); // <- ends up being
var result;
while( result = pat.exec(someText) ) {
// do stuff!
}
There has to be something wrong with my RegExp, right?
EDIT:
The .? thing was just a quick and dirty attempt to say that I'd like to match one of those words AND/OR one of those words with a single char on either side. ex:
\sNCAA\s
NCAA
NCAA\s
\sNCAA
GOAL:
I'm trying to do some simple hit highlighting based on some search words. I've got a function that gets all of the text nodes on a page, and I'd like to go through them all and highlight any matches to any of the terms in my phrase variable.
I think that I just need to rework how I am building my RegExp.

Well, first of all you're specifying case-insensitivity, and secondly, you are matching the letter I as one of your matchable string.
Champion would match pio and i, because they both match /.?I.?/gi
It however doesn't match /.?Champions,.?/gi because of the trailing comma.

Add start (^) and end ($) anchors to the regexp.
/^.?(NCAA|Division|I|Basketball|Champions,|1939-2011).?$/gi
Without the anchors, the regexp's match can start and end anywhere in the string, which is why
/.?(NCAA|Division|I|Basketball|Champions,|1939-2011).?/gi.exec('Champion')
can match pio and i: because it's actually matching around the (case-insensitive) I. If you leave the anchors off, but remove the ...|I|..., the regex won't match 'Champion':
> /.?(NCAA|Division|Basketball|Champions,|1939-2011).?/gi.exec('Champion')
null

Champion matches /.?I.?/i.
Your own output notes that it's matching the substring "pio".
Perhaps you meant to bound the expression to the start and end of the input, with ^ and $ respectively:
/^.?(NCAA|Division|I|Basketball|Champions,|1939-2011).?$/gi
I know you said to ignore the .?, but I can't: it's most likely wrong, and it's most likely going to continue to cause you problems. Explain why they're there and we can tell you how to do it properly. :)

We Keep Coding

JavaScript is the programming language of the Web.

Javascript regex: How to find length of a match? - javascript

If your regular expression has the g flag, you can execute it over and over again, getting all matches from the string, including the length of the match: var re = /^(\t|[ ]{4})/g; var match; while((match = re.exec(text)) { // use match.index and match[0].length }

Related

Repeat a javascript replace until no change is made?

Remove everything after constant using regex

How to write regexp for finding :smile: in javascript?

regex encapsulation

Javascript RegExp Matching weirdness

Categories

Resources