javascript regexp backreferences in character class possible?

javascript regexp backreferences in character class possible? - javascript

Does javascript regular expressions support backreferences inside character class?
I want to do something like this:
htmlString.replace(/attribute_name=(['"])[^\1]*\1/,'')
But that does not work. This does:
htmlString.replace(/attribute_name=(['"])[^'"]*\1/,'')
Unfortunatelly my attribute_name can contain apostrophes or quotes, so I need to exlude the actual quoting character from the inside of the attribute, but leave the other one.
I can't be sure which one is used. I can safely assume that quotes are in form of entity, but still there can be apostrophes inside:
<div attribute_name="John's car" class="someClass"></div>
<div attribute_name='some "quoted text"' class="someClass"></div>
I am not able to predict which of " or ' will be used around the attribute.
How to get rid of the attribute and leave the class attribute alone (not cut too much)?
context:
I am getting the html by $('templateContainer').innerHTML . I have to modify that html before inserting it into the page again. I have to cut some non-standard attibutes and all the ID attributes.

I agree with the other answers in that I don't think that the attributes are the place to do this type of thing, but I'm also wary of recommending the DOM either. I just feel dirty when I do that, I don't know why.
I usually will try to use a javascript object to store my data in and then reference it using well-formed keys, etc. shrug It's more work, but it's cleaner IMHO. But, it definitely isn't the only way to accomplish the task.
As to your question, you could also use the non-greedy matching in JavaScript and it would look like this:
htmlString.replace(/ ?attribute_name=(['"]).*?\1/, '')
Regular Expressions in JavaScript | evolt.org

You'd be a LOT better off using DOM or some other actual model designed for hierarchical content. That said, if you must use regex, the simplest way would probably be to just use a | (OR) instead.
htmlString.replace(/attribute_name=('[^']*'|"[^"]*")/,'')

Related

How to append string after matching field with regex

I want to append a word after <body> tag, it should not modify/replace anything other than just append a word. I have done something like this, is it valid do empty parenthesis fir second capture group will match everything?
/(<body[^>]*>)()/, `$1${my_variable}$2`)

The second capture group, designed to capture nothing, will match "nothing" - it will form a match immediately after your closed body tag. There's nothing wrong with doing this for the regex, though you might want to be wary of using [^>]* - this negated character class will gladly match across lines and grab as much input as it can. Handy for matching multi-line tags, but often very dangerous.
Also, if you're on linux and for some reason have > symbols in filenames (which is valid!) your regex will break horribly, as shown here.
That being said, valid regex or not, it's usually a bad idea to use regex with html, since HTML isn't a regular language. Also, you could accidentally summon Cthulhu.

let page = "<html><body>Some info</body></html>";
page.replace("<body>", `<body>${my_variable}`);
or
page.replace(/<body>|<BODY>/, `<body>${my_variable}`);
If in the broweser you can also use document.querySelector("body").innerHTML
Also depending on which framework you're using there are better ways to accomplish this.

Preserving attributes without value when manipulating with JQuery

The crux of my problem comes down to this issue:
$('<video allowfullscreen></video>').prop('outerHTML') === '<video allowfullscreen></video>' //Is False
$('<video allowfullscreen></video>').prop('outerHTML') === '<video allowfullscreen=""></video>' //Is True
The input I'm giving to jQuery gets partially mangled and transformed in an unwanted way.
My goal is that I have (trusted) html coming in that I want to modify by adding some attributes and wrapping it in other elements before converting it back to a String and passing it to the user as text they can copy.
So an expected output might be something like:
<div><video class="myClass" allowfullscreen></video></div>
Since the input html is coming from elsewhere I'd like to make as little assumptions about it as possible. So ideally I don't want to take the string and parse over it to fix specific attributes or remove instances of ="" (in case there's a reason at some point to specifically set a property to "").
Even if I don't care about having a value set on these properties the correct value would be allowfullscreen="allowfullscreen" anyways. I don't have control over the html coming in so I need to take it as-is. So I can't simply 'fix' the html to pass along something like allowfullscreen="allowfullscreen".
Are there any options or ways to preserve valueless properties when I go from string->jQuery->string?
I'm even open to other technology suggestions that would be better suited to this sort of DOM manipulation, but jQuery would otherwise be ideal because of how concise its syntax is. Vanilla Javascript can do it properly, but the syntax makes the code more brittle which I would like to avoid.

See HTML5 - 8.1.2.3 Attributes
8.1.2.3 Attributes
Attributes for an element are expressed inside the element's start
tag.
Attributes have a name and a value. Attribute names must consist of
one or more characters other than the space characters, U+0000 NULL,
U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), ">" (U+003E), "/"
(U+002F), and "=" (U+003D) characters, the control characters, and any
characters that are not defined by Unicode. In the HTML syntax,
attribute names, even those for foreign elements, may be written with
any mix of lower- and uppercase letters that are an ASCII
case-insensitive match for the attribute's name.
Attribute values are a mixture of text and character references,
except with the additional restriction that the text cannot contain an
ambiguous ampersand.
Attributes can be specified in four different ways:
Empty attribute syntax
Just the attribute name. The value is implicitly the empty string.

match text between two html custom tags but not other custom tags

I have something like the following;-
<--customMarker>Test1<--/customMarker>
<--customMarker key='myKEY'>Test2<--/customMarker>
<--customMarker>Test3 <--customInnerMarker>Test4<--/customInnerMarker> <--/customMarker>
I need to be able to replace text between the customMarker tags, I tried the following;-
str.replace(/<--customMarker>(.*?)<--\/customMarker>/g, 'item Replaced')
which works ok. I would like to also ignore custom inner tags and not match or replace them with text.
Also I need a separate expression to extract the value of the attribute key='myKEY' from the tag with Text2.
Many thanks
EDIT
actually I am trying to find things between comment tags but the comment tags were not displaying correctly so I had to remove the '!'. There's a unique situation that required comment tags... in anycase if anyone knows enough regex to help, it would be great. thank u.

In the end, I did something like the following (incase anyone else needs this. enjoy!!! But note: Word about town is that using regex with html tags is not ideal, so do your own research and make up your mind. For me, it had to be done this way, mostly bcos i wanted to, but also bcos it simplified the job in this instance);-
var retVal = str.replace(/<--customMarker>(.*?)<--\/customMarker>/g, function(token, match){
//question 1: I would like to also ignore custom inner tags and not match or replace them with text.
//answer:
var replacePattern = /<--customInnerMarker*?(.*?)<--\/customInnerMarker-->/g;
//remove inner tags from match
match = $.trim(match.replace(replacePattern, ''));
//replace and return what is left with a required value
return token.replace(match, objParams[match]);
//question 2: Also I need a separate expression to extract the value of the attribute key='myKEY' from the tag with Text2.
//answer
var attrPattern = /\w+\s*=\s*".*?"/g;
attrMatches = token.match(attrPattern);//returns a list of attributes as name/value pairs in an array
})

Can't you use <customMarker> instead? Then you can just use getElementsByTagName('customMarker') and get the inner text and child elements from it.

A regex merely matches an item. Once you have said match, it is up to you what you do with it. This is part of the problem most people have with using regular expressions, they try and combine the three different steps. The regex match is just the first step.
What you are asking for will not be possible with a single regex. You're going to need a mini state machine if you want to use regular expressions. That is, a logic wrapper around the matches such that it moves through each logical portion.
I would advise you look in the standard api for a prebuilt engine to parse html, rather than rolling your own. If you do need to do so, read the flex manual to get a basic understanding of how regular expressions work, and the state machines you build with them. The best example would be the section on matching multiline c comments.

how to escape JavaScript code in HTML

I have some addHtml JavaScript function in my JS code. I wonder how to escape HTML/JS code properly. Basically, what I am trying right now is:
addHtml("<a onclick=\"alert(\\\"Hello from JS\\\")\">click me</a>")
However, that doesn't work. It adds the a element but it doesn't do anything when I click it.
I don't want to replace all " by ' as a workaround. (If I do, it works.)

I wonder how to escape HTML/JS code properly.
To insert string content into an HTML event handler attribute:
(1) Encode it as a JavaScript string literal:
alert("Hello \"world\"");
(2) Encode the complete JavaScript statement as HTML:
<a onclick="alert("Hello \"world\""">foo</a>
And since you seem to be including that HTML inside a JavaScript string literal again, you have to JS-encode it again:
html= "<a onclick=\"alert("Hello \\"world\\""\">foo<\/a>";
Notice the double-backslashes and also the <\/, which is necessary to avoid a </ sequence in a <script> block, which would otherwise be invalid and might break.
You can make this less bad for yourself by mixing single and double quotes to cut down on the amount of double-escaping going on, but you can't solve it for the general case; there are many other characters that will cause problems.
All this escaping horror is another good reason to avoid inline event handler attributes. Slinging strings full of HTML around sucks. Use DOM-style methods, assigning event handlers directly from JavaScript instead:
var a= document.createElement('a');
a.onclick= function() {
alert('Hello from normal JS with no extra escaping!');
};

My solution would be
addHtml('<a onclick="alert(\'Hello from JS\')">click me</a>')
I typically use single quotes in Javascript strings, and double quotes in HTML attributes. I think it's a good rule to follow.

How about this?
addHtml("<a onclick=\"alert("Hello from JS")\">click me</a>");
It worked when I tested in Firefox, at any rate.

addHtml("<a onclick='alert(\"Hello from JS\")'>click me</a>")

The problem is probably this...
As your code is now, it will add this to the HTML
<a onclick="alert("Hello from Javascript")"></a>
This is assuming the escape slashes will all be removed properly.
The problem is that the alert can't handle the " inside it... you'll have to change those quotes to single quotes.
addHtml("<a onclick=\"alert(\\\'Hello from JS\\\')\">click me</a>")
That should work for you.

What does the final HTML rendered in the browser look like ? I think the three slashes might be causing an issue .

Processing Javascript RegEx submatches

I am trying to write some JavaScript RegEx to replace user inputed tags with real html tags, so [b] will become <b> and so forth. the RegEx I am using looks like so
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
with the following JavaScript
s.replace(exptags,"<$1>$2</$1>");
this works fine for single nested tags, for example:
[b]hello[/b] [u]world[/u]
but if the tags are nested inside each other it will only match the outer tags, for example
[b]foo [u]to the[/u] bar[/b]
this will only match the b tags. how can I fix this? should i just loop until the starting string is the same as the outcome? I have a feeling that the ((.){1,}?) patten is wrong also?
Thanks

The easiest solution would be to to replace all the tags, whether they are closed or not and let .innerHTML work out if they are matched or not it will much more resilient that way..
var tagreg = /\[(\/?)(b|u|i|s|center|code)]/ig
div.innerHTML="[b][i]helloworld[/b]".replace(tagreg, "<$1$2>") //no closing i
//div.inerHTML=="<b><i>helloworld</i></b>"

AFAIK you can't express recursion with regular expressions.
You can however do that with .NET's System.Text.RegularExpressions using balanced matching. See more here: http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx
If you're using .NET you can probably implement what you need with a callback.
If not, you may have to roll your own little javascript parser.
Then again, if you can afford to hit the server you can use the full parser. :)
What do you need this for, anyway? If it is for anything other than a preview I highly recommend doing the processing server-side.

You could just repeatedly apply the regexp until it no longer matches. That would do odd things like "[b][b]foo[/b][/b]" => "<b>[b]foo</b>[/b]" => "<b><b>foo</b></b>", but as far as I can see the end result will still be a sensible string with matching (though not necessarily properly nested) tags.
Or if you want to do it 'right', just write a simple recursive descent parser. Though people might expect "[b]foo[u]bar[/b]baz[/u]" to work, which is tricky to recognise with a parser.

The reason the nested block doesn't get replaced is because the match, for [b], places the position after [/b]. Thus, everything that ((.){1,}?) matches is then ignored.
It is possible to write a recursive parser in server-side -- Perl uses qr// and Ruby probably has something similar.
Though, you don't necessarily need true recursive. You can use a relatively simple loop to handle the string equivalently:
var s = '[b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]';
var exptags = /\[(b|u|i|s|center|code){1}]((.){1,}?)\[\/(\1){1}]/ig;
while (s.match(exptags)) {
s = s.replace(exptags, "<$1>$2</$1>");
}
document.writeln('<div>' + s + '</div>'); // after
In this case, it'll make 2 passes:
0: [b]hello[/b] [u]world[/u] [b]foo [u]to the[/u] bar[/b]
1: <b>hello</b> <u>world</u> <b>foo [u]to the[/u] bar</b>
2: <b>hello</b> <u>world</u> <b>foo <u>to the</u> bar</b>
Also, a few suggestions for cleaning up the RegEx:
var exptags = /\[(b|u|i|s|center|code)\](.+?)\[\/(\1)\]/ig;
{1} is assumed when no other count specifiers exist
{1,} can be shortened to +

Agree with Richard Szalay, but his regex didn't get quoted right:
var exptags = /\[(b|u|i|s|center|code)](.*)\[\/\1]/ig;
is cleaner. Note that I also change .+? to .*. There are two problems with .+?:
you won't match [u][/u], since there isn't at least one character between them (+)
a non-greedy match won't deal as nicely with the same tag nested inside itself (?)

Yes, you will have to loop. Alternatively since your tags looks so much like HTML ones you could replace [b] for <b> and [/b] for </b> separately. (.){1,}? is the same as (.*?) - that is, any symbols, least possible sequence length.
Updated: Thanks to MrP, (.){1,}? is (.)+?, my bad.

How about:
tagreg=/\[(.?)?(b|u|i|s|center|code)\]/gi;
"[b][i]helloworld[/i][/b]".replace(tagreg, "<$1$2>");
"[b]helloworld[/b]".replace(tagreg, "<$1$2>");
For me the above produces:
<b><i>helloworld</i></b>
<b>helloworld</b>
This appears to do what you want, and has the advantage of needing only a single pass.
Disclaimer: I don't code often in JS, so if I made any mistakes please feel free to point them out :-)

You are right about the inner pattern being troublesome.
((.){1,}?)
That is doing a captured match at least once and then the whole thing is captured. Every character inside your tag will be captured as a group.
You are also capturing your closing element name when you don't need it and are using {1} when that is implied. Below is a cleanup up version:
/\[(b|u|i|s|center|code)](.+?)\[\/\1]/ig
Not sure about the other problem.

We Keep Coding

JavaScript is the programming language of the Web.

javascript regexp backreferences in character class possible? - javascript

You'd be a LOT better off using DOM or some other actual model designed for hierarchical content. That said, if you must use regex, the simplest way would probably be to just use a | (OR) instead. htmlString.replace(/attribute_name=('[^']'|"[^"]")/,'')

Related

How to append string after matching field with regex

Preserving attributes without value when manipulating with JQuery

match text between two html custom tags but not other custom tags

how to escape JavaScript code in HTML

Processing Javascript RegEx submatches

Categories

Resources

We Keep Coding

JavaScript is the programming language of the Web.

javascript regexp backreferences in character class possible? - javascript

You'd be a LOT better off using DOM or some other actual model designed for hierarchical content. That said, if you must use regex, the simplest way would probably be to just use a | (OR) instead. htmlString.replace(/attribute_name=('[^']*'|"[^"]*")/,'')

Related

How to append string after matching field with regex

Preserving attributes without value when manipulating with JQuery

match text between two html custom tags but not other custom tags

how to escape JavaScript code in HTML

Processing Javascript RegEx submatches

Categories

Resources

You'd be a LOT better off using DOM or some other actual model designed for hierarchical content. That said, if you must use regex, the simplest way would probably be to just use a | (OR) instead. htmlString.replace(/attribute_name=('[^']'|"[^"]")/,'')