javascript regex for matching attributes in HTML string - javascript

Can anyone look at my regex in javascript and suggest a correct one?
I'm trying to select attributes(name/value) pairs in an HTML/XML string like following?
<unknowncustom:tag attrib1="XX' XX'" attrib2='YY" YY"' attrib3=ZZ""'>/unknowncustom:tag>
SOME TEXT that is not part of any tag and should not be selected, name='XX', y='ee';
<custom:tag attrib1="XX' XX'" attrib2='YY" YY"' attrib3=ZZ""'>/custom:tag>
I found many solutions but none seem foolproof (including this one Regular expression for extracting tag attributes)
My current regex selects the first attribute pair but can't figure out how to make it select all matching attributes. Here is the regex:
/<\w*:?\w*\s+(?:((\w*)\s*=\s*((?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))))[^>]*>/gim
Thanks

Let's have a go:
/(\w+)\s*=\s*((["'])(.*?)\3|([^>\s]*)(?=\s|\/>))(?=[^<]*>)/g
Regex is not ideal for this. If your attributes contain unescaped angle brackets < > it probably will not work.
Proof: http://regex101.com/r/dD4uT4

Related

How to append string after matching field with regex

I want to append a word after <body> tag, it should not modify/replace anything other than just append a word. I have done something like this, is it valid do empty parenthesis fir second capture group will match everything?
/(<body[^>]*>)()/, `$1${my_variable}$2`)
The second capture group, designed to capture nothing, will match "nothing" - it will form a match immediately after your closed body tag. There's nothing wrong with doing this for the regex, though you might want to be wary of using [^>]* - this negated character class will gladly match across lines and grab as much input as it can. Handy for matching multi-line tags, but often very dangerous.
Also, if you're on linux and for some reason have > symbols in filenames (which is valid!) your regex will break horribly, as shown here.
That being said, valid regex or not, it's usually a bad idea to use regex with html, since HTML isn't a regular language. Also, you could accidentally summon Cthulhu.
let page = "<html><body>Some info</body></html>";
page.replace("<body>", `<body>${my_variable}`);
or
page.replace(/<body>|<BODY>/, `<body>${my_variable}`);
If in the broweser you can also use document.querySelector("body").innerHTML
Also depending on which framework you're using there are better ways to accomplish this.

Javascript regex- How to match xpaths

I am creating a regex for matching xpaths generated by firebug, cna some one help me with that, an example xpath is:
.//*[#id='tab-HOME']/li[2]/span/span[1]/span/span[2]/span[2]/span/span
.//*[#id='any_possible_id']/span/span[2]/span/span
Now keeping in mind the names alowed for id's in javascript what can be the possible regex. I want to match
.//*[#id='any_possible_id']/li
Here is what I tried:
alert(/^\.\/\/\*\[[#id=]*\]/.test(xpath));
certainly incomplete.
Use [^\]]+ after the id to match until the next ] of the [#id=...]
alert(/^\.\/\/\*\[#id=[^\]]+\]\/li/.test(xpath));
If you do not want to match only for ../li then remove the li from the regex.

return substring match

I am trying to get just a part of a string with a regex
this is the string i am testing
class1 container _box _box_CEC493
the string is a series of classes applied to an element.
what i would like to get is just CEC493 which changes since the regex will be applied to a bunch of different elements (therefore string like the one above)
the regex i am using now is
/\s_box_([0-9a-zA-Z]+)/
which returns
_box_CEC493, CEC493
How can i modify it in order to get just the second value (CEC493)?
Thank you
You could probably just split the string:
var str = "class1 container _box _box_CEC493";
var match = str.split('_').pop();
alert(match);
DEMO
The standard way regexes come back is like this:
[0]: Whole result
[1]: First parentheses capture group
etc
So the standard way that people access these is with result[1]. Does that cause any issues in your case?
[updated]
instead of selecting all characters, select until an unwanted character,, and since you are selecting from a number of classes, it is possible that you have the _box_.. class alone without a space before it, so don't use space at the beginning of your regex selector.
str.match(/_box_([^\s]*)/)[1]
jsfiddle

match text between two html custom tags but not other custom tags

I have something like the following;-
<--customMarker>Test1<--/customMarker>
<--customMarker key='myKEY'>Test2<--/customMarker>
<--customMarker>Test3 <--customInnerMarker>Test4<--/customInnerMarker> <--/customMarker>
I need to be able to replace text between the customMarker tags, I tried the following;-
str.replace(/<--customMarker>(.*?)<--\/customMarker>/g, 'item Replaced')
which works ok. I would like to also ignore custom inner tags and not match or replace them with text.
Also I need a separate expression to extract the value of the attribute key='myKEY' from the tag with Text2.
Many thanks
EDIT
actually I am trying to find things between comment tags but the comment tags were not displaying correctly so I had to remove the '!'. There's a unique situation that required comment tags... in anycase if anyone knows enough regex to help, it would be great. thank u.
In the end, I did something like the following (incase anyone else needs this. enjoy!!! But note: Word about town is that using regex with html tags is not ideal, so do your own research and make up your mind. For me, it had to be done this way, mostly bcos i wanted to, but also bcos it simplified the job in this instance);-
var retVal = str.replace(/<--customMarker>(.*?)<--\/customMarker>/g, function(token, match){
//question 1: I would like to also ignore custom inner tags and not match or replace them with text.
//answer:
var replacePattern = /<--customInnerMarker*?(.*?)<--\/customInnerMarker-->/g;
//remove inner tags from match
match = $.trim(match.replace(replacePattern, ''));
//replace and return what is left with a required value
return token.replace(match, objParams[match]);
//question 2: Also I need a separate expression to extract the value of the attribute key='myKEY' from the tag with Text2.
//answer
var attrPattern = /\w+\s*=\s*".*?"/g;
attrMatches = token.match(attrPattern);//returns a list of attributes as name/value pairs in an array
})
Can't you use <customMarker> instead? Then you can just use getElementsByTagName('customMarker') and get the inner text and child elements from it.
A regex merely matches an item. Once you have said match, it is up to you what you do with it. This is part of the problem most people have with using regular expressions, they try and combine the three different steps. The regex match is just the first step.
What you are asking for will not be possible with a single regex. You're going to need a mini state machine if you want to use regular expressions. That is, a logic wrapper around the matches such that it moves through each logical portion.
I would advise you look in the standard api for a prebuilt engine to parse html, rather than rolling your own. If you do need to do so, read the flex manual to get a basic understanding of how regular expressions work, and the state machines you build with them. The best example would be the section on matching multiline c comments.

Javascript regular expression to strip out content between double quotes

I'm looking for a javascript regex that will remove all content wrapped in quotes(and the qoutes too), in a string that is the outlook format for listing email addresses. Take a look at the sample below, I am a regex tard and really need some help with this one, any help/resources would be appreciated!
"Bill'sRestauraunt"BillsRestauraunt#comcast.net,"Rob&Julie"robjules#ntelos.net,"Foo&Bar"foobar#cstone.net
Assuming no nested quotes:
mystring.replace(/"[^"]*"/g, '')
Try this regular expression:
/(?:"(?:[^"\\]+|\\(?:\\\\)*.)*"|'(?:[^'\\]+|\\(?:\\\\)*.)*')/g
Here's a regex I use to find and decompose the quoted strings within a paragraph. It also isolates several attendant tokens, especially adjacent whitespace. You can string together whichever parts you want.
var re = new RegExp(/([^\s\(]?)"(\s*)([^\\]*?(\\.[^\\]*)*)(\s*)("|\n\n)([^\s\)\.\,;]?)/g);

Categories