Performing a non greedy regular expresssion match javascript - javascript

My input string looks something like:
var someString = 'This is a nice little string with <a target="_" href="/carSale/12/..">link1</a>. But there is more that we want to do with this. Lets insert another <a target="_" href="/carSale/13/..">link2</a> ';
My end goal is to match every anchor element that has a"carSale" within its href attribute and replace it with the text insider the anchor.
for e.g
Replace <a target="_" href="/carSale/12/..">link1</a> with string link1
but it should not replace
<a target="_" href="/bikeSale/12/..">link3</a>
since the above href does not contain the string "carSale"
I have created a regular expression object for this. But it seems to be performing a greedy match.
var regEx = /(<a.*carSale.*>)(.*)(<\/a>)/;
var someArr = someString.match(regEx);
console.log(someArr[0]);
console.log(someArr[1]);
console.log(someArr[2]);
console.log(someArr[3]);
Appending the modifier 'g' at the end fo the regular expression gives bizare results.
Fiddle here :
http://jsfiddle.net/jameshans/54X5b/

Rather than using a regular expression, use a parser. This won't break as easily and uses the native (native as in the browser's) parser so is less susceptible to bugs:
var div = document.createElement("div");
div.innerHTML = someString;
// Get links
var links = div.querySelectorAll("a");
for (var i = 0; i < links.length; ++i) {
var a = links[i];
// If the link contains a href with desired properties
if (a.href.indexOf("carSale") >= 0) {
// Replace the element with text
div.replaceChild(document.createTextNode(a.innerHTML), a);
}
}
See http://jsfiddle.net/prankol57/d72Vr/
However, if you are confident that your html will always follow the pattern specified by your regex, then you can use it. I will drop a link to
RegEx match open tags except XHTML self-contained tags

Online Demo
I am not sure what is what are your matching groups but how about this expression:
/^<a.*href="((?:.*)carSale(?:.*))".*>(.*)<\/a>$/
Note that in this expression I am matching href to contain carSale which I think is where you want the expression to match.
And since you want to replace the whole expression as I understand all you need to do is:
var result = '<a target="_" href="\/carSale/12\/..">link1<\/a>'.replace(/(^<a.*href="((?:.*)carSale(?:.*))".*>(.*)<\/a>$)/,"temp text");

Or this one:
/(<a.*?carSale.*?>)(.*?)(<\/a>)/
The ? makes your repeater non-greedy, so it eats as little as possible, versus the default behavior of * which is to eat as much as possible. So with the ? added, the (.*?) will stop at the first </a> rather than the last one

(<a[^>]*(href=\"([^>]*(?=carSale)[^>]*)\")[^>]*>)([^<]*)(<\/a>)*
groups 3 and 4 are what you are interested in

Related

Extract inner text from anchor tag string using a regular expression in JavaScript

I am new to angular js . I have regex which gets all the anchor tags. My reg ex is
/<a[^>]*>([^<]+)<\/a>/g
And I am using the match function here like ,
var str = 'abc.jagadale#gmail.com'
So Now I am using the code like
var value = str.match(/<a[^>]*>([^<]+)<\/a>/g);
So, Here I am expecting the output to be abc.jagadale#gmail.com , But I am getting the exact same string as a input string . can any one please help me with this ? Thanks in advance.
Why are you trying to reinvent the wheel?
You are trying to parse the HTML string with a regex it will be a very complicated task, just use DOM or jQuery to get the links contents, they are made for this.
Put the HTML string as the HTML of a jQuery/DOM element.
Then fetch this created DOM element to get all the a elements
inside it and return their contents in an array.
This is how should be your code:
var str = 'abc.jagadale#gmail.com';
var results = [];
$("<div></div>").html(str).find("a").each(function(l) {
results.push($(this).text());
});
Demo:
var str = 'abc.jagadale#gmail.com';
var results = [];
$("<div></div>").html(str).find("a").each(function(l) {
results.push($(this).text());
});
console.log(results);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
You need to capture the group inside the anchor tags. The regular expression already matches the inner group ([^<]+) But, when matching there are different ways to extract that inner text.
When using the Match function it will return an array of matched elements, the first one, will match the whole regular expression and the following elements will match the included groups in the regular expression.
Try this:
var reg = /<a[^>]*>([^<]+)<\/a>/g
reg.exec(str)[1]
Also the match function will return an array only if the g flag is not present.
Check https://javascript.info/regexp-groups for further documentation.
Brief
Don't use regex for this. Regex is a great tool, don't get me wrong, but it's not what you're looking for. Regex cannot properly parse HTML and should only be used to do so if it's a limited, known set of HTML.
Try, for example, adding content:">" to your style attribute. You'll see your pattern now fails or gives you an incorrect result. I don't like to use this quote all the time, but I think it's necessary to use it in this case:
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
Use builtin functions. jQuery makes this super easy to accomplish. See my Code section for a demonstration. It's way more legible than any regex variant.
Code
DOM from page
The following snippet gets all anchors on the actual page.
$("a").each(function() {
console.log($(this).text())
})
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
abc.jagadale#gmail.com
abc2.jagadale#gmail.com
DOM in string
The following snippet gets all anchors in the string (converted to DOM element)
var s = `email3#domain.com
email4#domain.com`
$("<div></div>").html(s).find("a").each(function() {
console.log($(this).text())
})
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
email1#domain.com
email2#domain.com
Given the use case of parsing a string, instead of having an actual DOM to work with, it does seem like regex is the way to go, unless you want to load the HTML into a document fragment and parse that.
One way to get all of your matches is to make use of split:
var htmlstr = "<p><a href='url'>asdf#bsdf.com</a></p>"
var matches = htmlstr.split(/<a.+?>([A-Za-z.#]+)<\/a>/).filter((t, i) => i % 2)
Using a regex with split returns all of the matches along with the text around them, then filtering by index % 2 will pare it down to just the regex matches.

Javascript: replace() all but only outside html tags

I have an autocomplete form and when showing the results matching the user's search string, I want to highlight the search string itself. I plan to do this by wrapping any occurrence of the search string within a tag such as , or a with a given class. Now, the problem is that when using regEx I have problems if the pattern occurs within a html tag.
For instance
var searchPattern = 'pa';
var originalString = 'The pattern to <span class="something">be replaced is pa but only outside the html tag</span>';
var regEx = new RegExp(searchPattern, "gi")
var output = originalString.replace(regEx, "<strong>" + searchPattern + "</strong>");
alert(output);
(Demo: http://jsfiddle.net/cumufLm3/7/ )
This is going to replace also the occurrence of "pa" within the tag
<span class="something">
breaking the code. I'm not sure how to deal with this. I've been checking various similar questions, and I've understood that in general I shouldn't use regular expressions to parse html. But I'm not sure if there is any quick way to parse smoothly the html string, alter the text of each node, and "rebuild" the string with the text altered?
Of course I suppose I could use $.parseHTML(), iterate over each node, and somehow rewrite the string, but this seems to me to be too complex and prone to errors.
Is there a smart way to parse the html string somehow to tell "do this only outside of html tags"?
Please notice that the content of the tag itself must be handled. So, in my example above, the replace() should act also on the part "be replaced is pa but only outside the html tag".
Any idea of either a regular expression solid enough to deal with this, or (better, I suppose) to elegantly handle the text parts within the html string?
Your code should look like this:
var searchWord = 'pa';
var originalString = 'The pattern to <span class="something">be replaced is pa but only outside the html tag</span>';
var regEx = new RegExp("(" + searchWord + ")(?!([^<]+)?>)", "gi");
var output = originalString.replace(regEx, "<strong>$1</strong>");
alert(output);
Source: http://pureform.wordpress.com/2008/01/04/matching-a-word-characters-outside-of-html-tags/
Parse the HTML and find all text nodes in it, doing the replace in all of them. If you are using jQuery you can do this by just passing the snippet to $() which parses it in a Document Fragment, which you can then query or step over all elements and find all the .text() to replace.

Javascript to search string and return number of matches

I have a string with a variety of html tags in it. For example:
var str = "<div>My text is <a> here</a> and it is <a> very wonderful</a>. For an example of how very great <a> my text is</a>. Please have a look</div>"
I would like to use javascript string.replace() to replace only the words that are not inside of a tag (anchor in this case). It will also depend on user input, so I need to include a variable in the RegExp. I am testing first with string.match(); to verify that I only get one match.
I borrowed the code from javascript regex replace some words with links, but not within existing links and just switched .replace to .match. so I use the following code:
var word = "very";
var regex = new RegExp("/" + word + "(?![^<]*?<\/a>)/g");
console.log(str.match(regex).length);
It returns to me
TypeError: titleText.match(...) is null
which suggests to me that no matches are found. If I build it like this, however:
console.log(str.match("/very(?![^<]*?<\/a>)/g").length);
I get the expected number of results; in this case, 1;
Any suggestions?
Don't include the slashes when building up the regex as a string:
var regex = new RegExp(word + "(?![^<]*?<\/a>)","g");

Regular expression in javascript to match outside of XML tags

I want find all matches of "a" in <span class="get">habbitant morbi</span> triastbbitique , except "a" in tags (See below "a" between **).
<span class="get">h*a*bbit*a*nt morbi</span> tri*a*stbbitique.
If I find them, I want to replace them and also I want to save original tags.
This expression doesn't work:
var variable = "a";
var reg = new RegExp("[^<]."+variable+".[^>]$",'gi');
I would recommend to not use a regular expression to parse HTML; it's not a regular grammar, and you will experience pain for all but simple cases.
Your question is still a bit unclear, but let me try rephrasing to see if I have it right:
You'd like to get all matches of a given string in a HTML document, except for matches in <tag> bodies?
Assuming you're using jQuery or similar:
// Let the browser parse it for you:
var container = document.createElement()
container.innerHTML = '<span class="get">habbitant morbi</span> triastbbitique'
var doc_text = $(container).text()
// And then you can just regex away normally:
doc_text.match(/a/gi)
(Even better would be to use DOMParser, but that doesn't have wide browser support yet)
If you're in Node, then you want to look for some libraries that help you parse HTML nodes (like jsdom); and then just splat out all the next nodes.
Note that this question isn't about parsing. This is lexing. Something that regex are regularly and properly used for.
If you want to go with regex there are a couple of ways you could do this.
A simple hack lookahead like:
a(?![^<>]*>)
note that this wont handle < and > quoted in tags/unescaped outside of tags properly.
A full blown tokenizer of the form:
(expression for tag|comments|etc)|(stuff outside that that i'm interested in)
Replaced with a function that does different things depending on which part was matched. If $1 matched it would be replaced by it self, if $2 matchehd replace it with *$2*
The full tokenizer way is of course not a trivial task, the spec isn't small.
But if simplifying to only match the basic tags, ignore CDATA, comments, script/style tags, etc, you could use the following:
var str = '<span class="a <lal> a" attr>habbitant 2 > morbi. 2a < 3a</span> triastbbitique';
var re = /(<[a-z\/](?:"[^"]*"|'[^']*'|[^'">]+)*>)|(a)/gi;
var res = str.replace(re, function(m, tag, a){
return tag ? tag : "*" + a + "*";
});
Result:
<span class="a <lal> a" attr>h*a*bbit*a*nt 2 > morbi. 2*a* < 3*a*</span> tri*a*stbbitique
Live Example:
var str = '<span class="a <lal> a" attr>habbitant 2 > morbi. 2a < 3a</span> triastbbitique';
var re = /(<[a-z\/](?:"[^"]*"|'[^']*'|[^'">]+)*>)|(a)/gi;
var res = str.replace(re, function(m, tag, a){
return tag ? tag : "*" + a + "*";
});
console.log(res);
This handles messy tags, quotes and unescaped </> in the HTML.
Couple examples of tokenizing HTML tags with regex (which should translate fine to JS regex):
Remove on* JS event attributes from HTML tags
Regex to allow only set of HTML Tags and Attributes

regular expression (javascript) How to match anything beween two tags any number of times

I'm trying to find all occurrences of items in HTML page that are in between <nobr> and </nobr> tags.
EDIT:(nobr is an example. I need to find content between random strings, not always tags)
I tried this
var match = /<nobr>(.*?)<\/nobr>/img.exec(document.documentElement.innerHTML);
alert (match);
But it gives only one occurrence. + it appears twice, once with the <nobr></nobr> tags and once without them. I need only the version without the tags.
you need to do it in a loop
var match, re = /<nobr>(.*?)<\/nobr>/img;
while((match = re.exec(document.documentElement.innerHTML)) !== null){
alert(match[1]);
}
use the DOM
var nobrs = document.getElementsByTagName("nobr")
and you can then loop through all nobrs and extract the innerHTML or apply any other action on them.
(Since I can't comment on Rafael's correct answer...)
exec is doing what it is supposed to do - finding the first match, returning the result in the match object, and setting you up for the next exec call. The match object contains (at index 0) the whole of the string matched by the whole of the regex. In subsequent slots are the bits of the string matched by the parenthesized subgroups. So match[1] contains the bit of the string matched by "(.*?)" in your example.
you can use
while (match = /<nobr>(.*?)<\/nobr>/img.exec("foo <nobr> hello </nobr> bar <nobr> world </nobr> foobar"))
alert (match[1]);
If the strings you're using aren't xml elements, and you're sticking with regexes the return value you're getting can be explained by the bracketing. .exec returns the whole matching string followed by the contents of the bracketed expressions.
If your doc contains:
This is out.
Bzz. This is in. unBzz.
then
/Bzz.(.*?)unBzz./img.exec(document.documentElement.innerHTML)
Will give you 'Bzz. This is in. unBzz.' in element 0 of the returned array and 'This is in.' in element 1. Trying to display the whole array gives both as a comma separated list because that's what JavaScript does to try to display it.
So
alert($match[1]);
is what you're after.
it takes to steps but you could do it like this
match = document.documentElement.innerHTML.match(/<nobr>(.*?)<\/nobr>/img)
alert(match)//includes '<nobr>'
match_length = match.length;
for (var i = 0; i < match_length; i++)
{
var match2 = match[i].match(/<nobr>(.*?)<\/nobr>/im);//same regex without the g option
alert(match2[1]);
}

Categories