Regex to match everything outside of a regex pattern - javascript

So I'd like to use javascript to replace all the words outside of HTML tags in a body of text. Check the explanation below.
I'd like to convert this:
<tag with-attr="something"></tag><tag>Text to match</tag><tag>Text to Match</tag>
...to this:
<tag with-attr="something"></tag><tag>Manipulated Text</tag><tag>Manipulated Text</tag>
Now, I have a regular expression that can match all the tags and its containing text:
\<[^>]*\>
But I'm not sure how to invert the expression, so to speak.
EDIT
Also, I'm looking to use the replace / match functions, not split, since I want to retain the tag information and spit the a working page back out with the new information.

using a paren-including split() RegExp and further array methods make "stream processing" fairly simple:
'<tag with-attr="something"></tag><tag>Text to match</tag>Text to Match<tag>'
.split(/(<[^>]+>)/).map(function(x,i){
if(!(i%2) && x){ x= escape(x); }
return x;
}).join("");
example output:
"<tag with-attr="something"></tag><tag>Text%20to%20match</tag>Text%20to%20Match<tag>"
the escape() is just to show that the textContent has indeed been altered...
i only vouch for input close to your example. deeply nested or invalid HTML might fool any RegExp, but i'm sure someone else will bring that up...

Something like this
/>([^<>]*\w)</
demo here : http://rubular.com/r/2QPLjOeMAu
Now you just need to replace the content like this :
var str = '<tag with-attr="something"></tag><tag>Text to match</tag><tag>Text to Match</tag>';
var res = str.replace(/>([^<>]*\w)</g, '>Manipulated text<');
console.log(res);

Related

Extract inner text from anchor tag string using a regular expression in JavaScript

I am new to angular js . I have regex which gets all the anchor tags. My reg ex is
/<a[^>]*>([^<]+)<\/a>/g
And I am using the match function here like ,
var str = 'abc.jagadale#gmail.com'
So Now I am using the code like
var value = str.match(/<a[^>]*>([^<]+)<\/a>/g);
So, Here I am expecting the output to be abc.jagadale#gmail.com , But I am getting the exact same string as a input string . can any one please help me with this ? Thanks in advance.
Why are you trying to reinvent the wheel?
You are trying to parse the HTML string with a regex it will be a very complicated task, just use DOM or jQuery to get the links contents, they are made for this.
Put the HTML string as the HTML of a jQuery/DOM element.
Then fetch this created DOM element to get all the a elements
inside it and return their contents in an array.
This is how should be your code:
var str = 'abc.jagadale#gmail.com';
var results = [];
$("<div></div>").html(str).find("a").each(function(l) {
results.push($(this).text());
});
Demo:
var str = 'abc.jagadale#gmail.com';
var results = [];
$("<div></div>").html(str).find("a").each(function(l) {
results.push($(this).text());
});
console.log(results);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
You need to capture the group inside the anchor tags. The regular expression already matches the inner group ([^<]+) But, when matching there are different ways to extract that inner text.
When using the Match function it will return an array of matched elements, the first one, will match the whole regular expression and the following elements will match the included groups in the regular expression.
Try this:
var reg = /<a[^>]*>([^<]+)<\/a>/g
reg.exec(str)[1]
Also the match function will return an array only if the g flag is not present.
Check https://javascript.info/regexp-groups for further documentation.
Brief
Don't use regex for this. Regex is a great tool, don't get me wrong, but it's not what you're looking for. Regex cannot properly parse HTML and should only be used to do so if it's a limited, known set of HTML.
Try, for example, adding content:">" to your style attribute. You'll see your pattern now fails or gives you an incorrect result. I don't like to use this quote all the time, but I think it's necessary to use it in this case:
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
Use builtin functions. jQuery makes this super easy to accomplish. See my Code section for a demonstration. It's way more legible than any regex variant.
Code
DOM from page
The following snippet gets all anchors on the actual page.
$("a").each(function() {
console.log($(this).text())
})
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
abc.jagadale#gmail.com
abc2.jagadale#gmail.com
DOM in string
The following snippet gets all anchors in the string (converted to DOM element)
var s = `email3#domain.com
email4#domain.com`
$("<div></div>").html(s).find("a").each(function() {
console.log($(this).text())
})
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
email1#domain.com
email2#domain.com
Given the use case of parsing a string, instead of having an actual DOM to work with, it does seem like regex is the way to go, unless you want to load the HTML into a document fragment and parse that.
One way to get all of your matches is to make use of split:
var htmlstr = "<p><a href='url'>asdf#bsdf.com</a></p>"
var matches = htmlstr.split(/<a.+?>([A-Za-z.#]+)<\/a>/).filter((t, i) => i % 2)
Using a regex with split returns all of the matches along with the text around them, then filtering by index % 2 will pare it down to just the regex matches.

comparing and replacing using regex in javascript : leaving a word in between

I am trying to replace a pattern as below:
Original :
welocme
Need to be replaced as :
welcome
Tried the below approach:
String text = "welocme";
Pattern linkPattern = Pattern.compile("a href=\"#");
text = linkPattern.matcher(text).replaceAll("a href=\"javascript:call()\"");
But not able to add the idvalue in between. Kindly help me out.
Thanks in advance.
how about a simple
text.replaceAll("#idvalue","javascript:call('idvalue')")
for this case only. If you are looking to do something more comprehensive, then as suggested in the other answer, an XML parser would be ideal.
Try getting the part that might change and you want to keep as a group, e.g. like this:
text = text.replaceAll( "href=\"#(.*?)\"", "href=\"javascript:call('$1')" );
This basically matches and replaces href="whatever" with whatever being caught by capturing group 1 and reinserted in the replacement string by using $1 as a reference to the content of group 1.
Note that applying regex to HTML and Javascript might be tricky (single or double quotes allowed, comments, nested elements etc.) so it might be better to use a html parser instead.
Add a capture group to the matcher regex and then reference the group in the replacemet. I found using the JavaDoc for Matcher, that you need to use '$' instead of '\' to access the capture group in the replacement.
Code:
String text = "welcome";
System.out.println("input: " + text);
Pattern linkPattern = Pattern.compile("a href=\"#([^\"]+)\"");
text = linkPattern.matcher(text).replaceAll("a href=\"javascript:call('$1')\"");
System.out.println("output: " +text);
Result:
input: welcome
output: welcome

Javascript: replace() all but only outside html tags

I have an autocomplete form and when showing the results matching the user's search string, I want to highlight the search string itself. I plan to do this by wrapping any occurrence of the search string within a tag such as , or a with a given class. Now, the problem is that when using regEx I have problems if the pattern occurs within a html tag.
For instance
var searchPattern = 'pa';
var originalString = 'The pattern to <span class="something">be replaced is pa but only outside the html tag</span>';
var regEx = new RegExp(searchPattern, "gi")
var output = originalString.replace(regEx, "<strong>" + searchPattern + "</strong>");
alert(output);
(Demo: http://jsfiddle.net/cumufLm3/7/ )
This is going to replace also the occurrence of "pa" within the tag
<span class="something">
breaking the code. I'm not sure how to deal with this. I've been checking various similar questions, and I've understood that in general I shouldn't use regular expressions to parse html. But I'm not sure if there is any quick way to parse smoothly the html string, alter the text of each node, and "rebuild" the string with the text altered?
Of course I suppose I could use $.parseHTML(), iterate over each node, and somehow rewrite the string, but this seems to me to be too complex and prone to errors.
Is there a smart way to parse the html string somehow to tell "do this only outside of html tags"?
Please notice that the content of the tag itself must be handled. So, in my example above, the replace() should act also on the part "be replaced is pa but only outside the html tag".
Any idea of either a regular expression solid enough to deal with this, or (better, I suppose) to elegantly handle the text parts within the html string?
Your code should look like this:
var searchWord = 'pa';
var originalString = 'The pattern to <span class="something">be replaced is pa but only outside the html tag</span>';
var regEx = new RegExp("(" + searchWord + ")(?!([^<]+)?>)", "gi");
var output = originalString.replace(regEx, "<strong>$1</strong>");
alert(output);
Source: http://pureform.wordpress.com/2008/01/04/matching-a-word-characters-outside-of-html-tags/
Parse the HTML and find all text nodes in it, doing the replace in all of them. If you are using jQuery you can do this by just passing the snippet to $() which parses it in a Document Fragment, which you can then query or step over all elements and find all the .text() to replace.

Javascript regular expression - get all text after [stuff in here]

I have strings in my program that are like so:
var myStrings = [
"[asdf] thisIsTheText",
"[qwerty] andSomeMoreText",
"noBracketsSometimes",
"[12345]someText"
];
I want to capture the strings "thisIsTheText", "andSomeMoreText", "noBracketsSometimes", "someText". The pattern of inputs will always be the same, square brackets with something in them (or maybe not) followed by some spaces (again, maybe not), and then the actual text I want.
How can I do this?
Thanks
One approach:
var actualTextYouWant = originalString.replace(/^\[[^\]]+\]\s*/, '');
This will return a copy of originalString with the initial [...] and whitespace removed.
This should get you started:
/(?:\[[^]]*])?\s*(\w+)/

Using exec() to Match a String

I have a string, which I want to extract the value out. The string is something like this:
cdata = "![CDATA[cu1hcmod6rbg3eenmk9p80c484ma9B]]";
And I want cu1hcmod6rbg3eenmk9p80c484ma9B. In other words, I want anything inside the ![[CDATA[*]].
I tried to use the following javascript snippet:
cdata = "![CDATA[cu1hcmod6rbg3eenmk9p80c484ma9B]]";
rePattern = new RegExp("![?:\\s+]]","m");
arrMatch = rePattern.exec( cdata );
result = arrMatch[0];
But the code is not working, I'm pretty sure that it's the way I how specify the matching string that's causing the problem. Any idea how to fix it?
Your pattern should be something like...
/^!\[CDATA\[(.+?)\]\]$/
Which is...
Match literal starting ![CDATA[.
Lazy match everything up until the closing ] and save it in capturing group $1 (thanks Phrogz for his excellent suggestion).
Match extra ]].
Your string should be available as arrMatch[1].
Try this:
var cdata = "![CDATA[cu1hcmod6rbg3eenmk9p80c484ma9B]]";
var regPattern = /(.*CDATA\[)(.*)(\]\].*)/gm;
alert(cdata.replace(regPattern, "$2"));

Categories