Because of the way that jQuery deals with script tags, I've found it necessary to do some HTML manipulation using regular expressions (yes, I know... not the ideal tool for the job). Unfortunately, it seems like my understanding of how captured groups work in JavaScript is flawed, because when I try this:
var scriptTagFormat = /<script .*?(src="(.*?)")?.*?>(.*?)<\/script>/ig;
html = html.replace(
scriptTagFormat,
'<span class="script-placeholder" style="display:none;" title="$2">$3</span>');
The script tags get replaced with the spans, but the resulting title attribute is blank. Shouldn't $2 match the content of the src attribute of a script tag?
Nesting of groups is irrelevant; their numbering is determined strictly by the positions of their opening parentheses within the regex. In your case, that means it's group #1 that captures the whole src="value" sequence, and group #2 that captures just the value part.
Try this:
/<script (?:(?!src).)*(?:src="(.*?)")?.*?>(.*?)<\/script>/ig
See here: rubular
As stema wrote, the .*? matches too much. With the negative lookahead (?:(?!src).)* you will match only until a src attribute.
But actually in this case you could also just move the .*? into the optional part:
/<script (?:.*?src="(.*?)")?.*?>(.*?)<\/script>/ig
See here: rubular
The .*? matches too much because the following group is optional, ==> your src is matched from one of the .*? around. if you remove the ? after your first group it works.
Update: As #morja pointed out your solution is to move the first .*? into the optional src part.
Just for completeness: /<script (?:.*?(src="(.*?)"))?.*?>(.*?)<\/script>/ig
You can see it here on rubular (corrected my link also)
If you don't want to use the content of the first capturing group, then make it a non capturing group using (?:)
/<script (?:.*?(?:src="(.*?)"))?.*?>(.*?)<\/script>/ig
Then your wanted result is in $1 and $2.
Could you post the html you are retrieving? Your code works fine in a simple example: jsfiddle (warning: alert box)
My first guess is that one of your script tags does not have a src meaning you are left with a single capture group (the script contents).
I'm thinking that regular expressions by themselves can't do exactly what I'm looking for, so here's my modification to work around the problem:
var scriptTagFormat = /<script\s+((.*?)="(.*?)")*\s*>(.*?)<\/script>/ig;
html = html.replace(
scriptTagFormat,
'<span class="script-placeholder" style="display:none;" $1>$4</span>');
Before, I wanted to avoid setting non-standard attributes on the replacement span. This code blindly copies all attributes instead. Luckily, the non-standard attributes aren't stripped out of the DOM when I insert the HTML, so it will work for my purposes.
Related
I'm trying to make a regular expression that finds the tagnames and attributes of elements. For example, if I have this:
<div id="anId" class="aClass">
I want to be able to get an array that looks like this:
["(full match)", "div", "id", "anId", "class", "aClass"]
Currently, I have the regex /<(\S*?)(?: ?(.*?)="(.*?)")*>/, but for some reason it skips over every attribute except for the last one.
var str = '<div id="anId" class="aClass">'
console.log(str.match(/<(\S*)(?: ?(.*?)="(.*?)")*>/));
Regex101: https://regex101.com/r/G0ncwF/2
Another odd thing: if I remove the * after the non-capture group, the capture group in quotes seems to somehow "forget" that it's lazy. (Regex101: https://regex101.com/r/C0UwI8/2)
Why does this happen, and how can I avoid it? I couldn't find any questions/answers that helped me (Python re.finditer match.groups() does not contain all groups from match looked promising, but didn't seem help me at all)
(note: I know there are better ways to get the attributes, I'm just experimenting with regex)
UPDATE:
I've figured out at least why the quantifiers seem to "forget" that they're lazy. It's actually just that the regex is trying to match all the way to the angle brackets. I suppose I must have been thinking that the non-capturing group was "insulating" everything and preventing that from happening, and I didn't see it was still lazy because there was only one angle bracket for it to find.
var str = '"foo" "bar"> "baz>"'
console.log("/\".*?\"/ produces ", str.match(/".*?"/), ", finds first quote, finds text, lazily stops at second quote");
console.log("/\".*?\">/ produces ", str.match(/".*?">/), ", finds first quote, finds text, sees second quote but doesn't see angle bracket, keeps going until it sees \">, lazily stops");
So at least that's solved. But I still don't understand why it skips over every attribute but the last one.
And note: Other regexes using different tricks to find the attributes are nice and all, but I'm mostly looking to learn why my regex skips over the attributes, so I can maybe understand regex a bit better.
Playing along with your experimentation you could do this: Instead of scanning for what you want, you can scan for what you don't want, and then filter it out:
const html = '<div id="anId" class="aClass">';
const regex = /[<> ="]/;
let result = html.split(regex).filter(Boolean);
console.log('result: '+JSON.stringify(result));
Output:
result: ["div","id","anId","class","aClass"]
Explanation:
regex /[<> ="]/ lists all chars you don't want
.split(regex) splits your text along the unwanted chars
.filter(Boolean) gets rid of the unwanted chars
Mind you this has flaws, for example it will split incorrectly for html <div id="anId" class="aClass anotherClass">, e.g a space in an attribute value. To support that you could preprocess the html with another regex to escape spaces in quotes, then postprocess with another regex to restore the spaces...
Yes, an HTML parser is more reliable for these kind of tasks.
I want to append a word after <body> tag, it should not modify/replace anything other than just append a word. I have done something like this, is it valid do empty parenthesis fir second capture group will match everything?
/(<body[^>]*>)()/, `$1${my_variable}$2`)
The second capture group, designed to capture nothing, will match "nothing" - it will form a match immediately after your closed body tag. There's nothing wrong with doing this for the regex, though you might want to be wary of using [^>]* - this negated character class will gladly match across lines and grab as much input as it can. Handy for matching multi-line tags, but often very dangerous.
Also, if you're on linux and for some reason have > symbols in filenames (which is valid!) your regex will break horribly, as shown here.
That being said, valid regex or not, it's usually a bad idea to use regex with html, since HTML isn't a regular language. Also, you could accidentally summon Cthulhu.
let page = "<html><body>Some info</body></html>";
page.replace("<body>", `<body>${my_variable}`);
or
page.replace(/<body>|<BODY>/, `<body>${my_variable}`);
If in the broweser you can also use document.querySelector("body").innerHTML
Also depending on which framework you're using there are better ways to accomplish this.
Because of some poor forward thinking when building my search database, I'm left with some links in the format of: (Google Homepage)[http://google.com]
I've been trying to mess with regex in Javascript to convert the format above into a regular HTML link in the format ofGoogle Homepage.
I've been able to pick out the parentheses and brackets via regex, but am having trouble getting regex to replace the parenthesis and brackets with HTML as appropriate. Thanks!
This is going to be pretty straightforward. Basically, just make two capture groups. One capture group will have the text inside of the parenthesis and the other will have the URL inside the square braces.
\((.*?)\)\[(.*?)\]
#1 #2
Then, you can simply stick each captured part into your tag, like this:
\1
Here is a demo
This works for me
var str1="(Google Homepage)[http://google.com]";
var pattern=/\((.*)\)\[(.*)\]/;
var str2=str1.replace(pattern,"$1");
console.log(str2);
considering that ( ) and [ ] define the boundaries, you can try
\(([^\)]+)\).*\[([^\]]+)\]
\1 will be text and \2 will be link
This question is similar to "Allowing new line characters in javascript regex"
but the solution /m not runs with str.replace. You can test the code below at this page
<p id="demo"><i>I need to TRIM the italics here,
despite this line.</i>
</p>
<button onclick="myFunction()">Try it</button>
<script>
function myFunction()
{
var str=document.getElementById("demo").innerHTML;
var n=str.replace(/^(\s*)<i>(.+)<\/i>(\s*)$/m,"$1$2$3"); //tested also /s
alert(str)
document.getElementById("demo").innerHTML=n;
}
</script>
This answer is mostly to give you some insight into why your current approach does not work, and how you generally solve it.
The reason m doesn't help is that the other answer is wrong. This is not what m does. m simply makes the anchors match line beginnings and endings in addition to the string beginnings and endings. Some regex flavors have s for what you want to accomplish, but not ECMAScript. The simplest thing (and general solution) is to replace . (which matches everything except line breaks) with [\s\S] (which matches whitespace and non-whitespace, i.e. everything).
However, Casimir's approach is better in your case, as it avoids some other problems like greediness. Of course, as Casimir said, if there are tags in between the opening and closing <i> tags, then the approach will not work. In that case, something like <i>([\s\S]+?)</i> might be an option, but that's still not the full solution, in case you have nested i-tags or attributes in the opening tag, or capitalized I-tags and whatnot.
All in all, using regex to parse HTML is wrong! You should really use DOM manipulation. Especially, since you are using Javascript - THE language for DOM manipulation. What you should really do is traverse the DOM for all i tags in your demo element, and replace them with their inner HTML.
A way to avoid problems with newlines is to not use the dot, example:
var n=str.replace(/<i>([^<]+)<\/i>/,"$1");
I have replaced the dot by [^<] (all that is not a <, that include newlines)
the m modifier is not needed here, and you don't need to capture white characters too.
Note that my solution suppose that you don't have any < between <i> and </i>
In the other case, when you have nested tags for example, you can use this trick to avoid lazy quantifier:
var n=str.replace(/<i>((?:[^<]+|<+(?!\/i>)+)<\/i>/,"$1");
I'm trying to build a text formatter that will add p and br tags to text based on line breaks. I currently have this:
s.replace(/\n\n/g, "\n</p><p>\n");
Which works wonderfully for creating paragraph ends and beginnings. However, trying to find instances isn't working so well. Attempting to do a matched group replacement isn't working, as it ignores the parenthesis and replaces the entire regex match:
s.replace(/\w(\n)\w/g, "<br />\n");
I've tried removing the g option (still replaced entire match, but only on first match). Is there another way to do this?
Thanks!
You can capture the parts you don't want to replace and include them in the replacement string with $ followed by the group number:
s.replace(/(\w)\n(\w)/g, "$1<br />\n$2");
See this section in the MDN docs for more info on referring to parts of the input string in your replacement string.
Catch the surrounding characters also:
s.replace(/(\w)(\n\w)/g, "$1<br />$2");