Regex to put quotes for html attributes - javascript

I have a scenario like this
in html tags, if the attributes is not surrounded either by single or double quotes.. i want to put double quotes for that
how to write regex for that?

If you repeat this regex as many times as there might be tags in an element, that should work so long as the text is fairly normal and not containing lots of special characters that might give false positives.
"<a href=www.google.com title = link >".replace(/(<[^>]+?=)([^"'\s][^\s>]+)/g,"$1'$2'")
Regex says: open tag (<) followed by one or more not close tags ([^>]+) ungreedily (?) followed by equals (=) all captured as the first group ((...)) and followed by second group ((...)) capturing not single or double quote or space ([^"'\s]) followed by not space or close tag ([^\s>]) one or more times (+) and then replace that with first captured group ($1) followed by second captured group in single quotes ('$2')
For example with looping:
html = "<a href=www.google.com another=something title = link >";
newhtml = null;
while(html != newhtml){
if(newhtml)
html = newhtml;
var newhtml = html.replace(/(<[^>]+?=)([^"'\s][^\s>]+)/,"$1'$2'");
}
alert(html);
But this is a bad way to go about your problem. It is better to use an HTML parser to parse, then re-format the HTML as you want it. That would ensure well formatted HTML wheras regular expressions could only ensure well formatted HTML if the input is exactly as expected.

Very helpful! I made a slight change to allow it to match attributes with a single character value:
/(<[^>]+?=)([^"'\s>][^\s>]*)/g (changed one or more + to zero or more * and added > to the first match in second group).

Related

Extracting and replacing html link tag with regex

I am trying to do some html scraping with JavaScript, and would like to take the a href link and replace it into a hyperlink on a Discord embed. I am having trouble with regex, I am finding it very difficult to learn.
I assume I will also need another regex to capture it all so I can replace it with my desired target?
This is an example raw html that I have:
An **example**, also known as a example type
to make this readable within a Discord embed, I am looking for a desired output of:
An **example**, also known as a [**example type**](https://www.example.com/example%20type)
I have tried extracting the URL via regex, which I can match however, I am having issues with extracting the link and the (I think its called target? The 'example type' in the example link text) and then replacing the string with my desired output.
I have the following: (https://regexr.com/73574)
/href="[^"]+/g
This matches href="https://www.example.com/example%20type, and feels like a very early step, it includes 'href' in the match, and it does not capture the target.
EDIT:
I apologise, I did not think about additional checks, what if the string has multiple links? and text after them, for example:
An **example**, also known as a example type is the first example, and now I have second example
with a desired output of:
An **example**, also known as a [**example type**](https://www.example.com/example%20type) is the first example, and now I have [**second**](https://www.example.com/second) example
Try this: (?<=href=")[^"]*
By using a lookbehind, you can now verify that the text behind is equal to href=" without capturing it
Demo: https://regex101.com/r/2qMnPt/1
You can use regular expression groups to capture things that interest you. My regular expression here might be far from perfect but I don't think that's important here - it shows you a way and you can always improve it if needed.
Things you have to do:
prepare regex that captures groups that you need (anchor tag, anchor text, anchor url),
remove the anchor tag completely from the text
inject anchor text and anchor href into the final string
Here's a quick code example of that:
const anchorRegex = /(<a\shref="([^"]+)">(.+?)<\/a>)/i;
const textToBeParsed = `An **example**, also known as a example type`;
const parseText = (text) => {
const matches = anchorRegex.exec(textToBeParsed);
if (!matches) {
console.warn("Something went wrong...");
return;
}
const [, fullAnchorTag, anchorUrl, anchorText] = matches;
const textWithoutAnchorTag = text.replace(fullAnchorTag, '');
return `${textWithoutAnchorTag}[**${anchorText}**](${anchorUrl})`;
};
console.log(parseText(textToBeParsed));
Solution:
const input = 'An **example**, also known as a example type first and second here no u and then done noice';
const output = input.replace(/<a href="([^"]+)">([^<]+)<\/a>/g, '[**$2**]($1)')
console.log(output);
Regex breakdown:
<a href=" - Matches the opening <a href" HTML tag
([^"]+) - This is a capturing group, matches a number of characters that are not double quotes
"> - Matches the closing double quotes, including the closing tag '>'
([^<]+) - Another capturing group, matches a number of characters that are not a less than symbol
<\/a> - Matches the closing HTML tag
I then use the replace method seen in my output variable.
Within the replace, you see two options (regex, replaceWith)
The first option is obvious, its the regex. The second option [**$2**]($1), uses the capturing groups we see in the regex, the first group $1 provides the link within the HTML tag, and $2 provides the HTML target (the name after the link, for example in my input variable, the first target you see is: 'example type'.
The only important bits in this option is: $2 and $1, however I wanted to display them in a certain way, [**target**](link).

Is it possible to combine those two regex or improve my code?

I would like to know if there is a way to combine those two regex below OR a way to combine my two tasks in another way.
1) /(<\/[a-z]>|<[a-z]*>)/g
2) /\s{2,}/g
Specifically, they are used to replace this:
This is <b>a test</b> and this <i> is also a test</i>
Into this:
This is <b> a test </b> and this <i> is also a test </i>
The first regex is used to add a space before and after every opening and closing tags and the second regex, is used to match every occurence of two or more space characters to be removed.
Here is the code
var inputString = 'This is <b>a test</b> and this <i> is also a test</i>',
spacedTags = inputString.replace(/(<\/[a-z]>|<[a-z]*>)/g, ' $1 '),
sanitizedSting = spacedTags.replace(/\s{2,}/g, ' ')
console.log(sanitizedSting);
and the jsfiddle.
I know those can be done using DOM manipulation which will probably be even faster but I'm trying to avoid this.
Thank you
If you look for trailing and preceding spaces, then use the inner capture group as the replacement value you can achieve something similar.
var inputString = 'This is <b>a test</b> and this <i> is also a test</i>',
spacedTags = inputString.replace(/(\s*(<\/[a-z]>|<[a-z]*>)\s*)/g, ' $2 ');
console.log(spacedTags);
JS Fiddle
This looks for anything that matches a beginning or ending tag optionally surrounded by whitespace. it then uses the inner match as the replacement with added spaces on either side.
Both implementations, though, always leave a trailing space after any closing tag. "</i> "
I haven't looked in to the performance changes from this, but it attempts to address the issue of one regular expression.
Is your problem that you may add a space where there already is one? In that case, discard all spaces before and after your tag:
sanitizedSting = inputString.replace(/\s*(<\/?[a-z]*>)\s*/g, ' $1 ');
This also adds a space at the end if you end with a tag (frankly, there are other problems with this exact code).

Javascript replace matched group

I'm trying to build a text formatter that will add p and br tags to text based on line breaks. I currently have this:
s.replace(/\n\n/g, "\n</p><p>\n");
Which works wonderfully for creating paragraph ends and beginnings. However, trying to find instances isn't working so well. Attempting to do a matched group replacement isn't working, as it ignores the parenthesis and replaces the entire regex match:
s.replace(/\w(\n)\w/g, "<br />\n");
I've tried removing the g option (still replaced entire match, but only on first match). Is there another way to do this?
Thanks!
You can capture the parts you don't want to replace and include them in the replacement string with $ followed by the group number:
s.replace(/(\w)\n(\w)/g, "$1<br />\n$2");
See this section in the MDN docs for more info on referring to parts of the input string in your replacement string.
Catch the surrounding characters also:
s.replace(/(\w)(\n\w)/g, "$1<br />$2");

Remove image elements from string

I have a string that contains HTML image elements that is stored in a var.
I want to remove the image elements from the string.
I have tried: var content = content.replace(/<img.+>/,"");
and: var content = content.find("img").remove(); but had no luck.
Can anyone help me out at all?
Thanks
var content = content.replace(/<img[^>]*>/g,"");
[^>]* means any number of characters other than >. If you use .+ instead, if there are multiple tags the replace operation removes them all at once, including any content between them. Operations are greedy by default, meaning they use the largest possible valid match.
/g at the end means replace all occurrences (by default, it only removes the first occurrence).
$('<p>').html(content).find('img').remove().end().html()
The following Regex should do the trick:
var content = content.replace(/<img[^>"']*((("[^"]*")|('[^']*'))[^"'>]*)*>/g,"");
It first matches the <img. Then [^>"']* matches any character except for >, " and ' any number of times. Then (("[^"]*")|('[^']*')) matches two " with any character in between (except " itself, which is this part [^"]*) or the same thing, but with two ' characters.
An example of this would be "asf<>!('" or 'akl>"<?'.
This is again followed by any character except for >, " and ' any number of times. The Regex concludes when it finds a > outside a set of single or double quotes.
This would then account for having > characters inside attribute strings, as pointed out by #Derek 朕會功夫 and would therefore match and remove all four image tags in the following test scenario:
<img src="blah.png" title=">:(" alt=">:)" /> Some text between <img src="blah.png" title="<img" /> More text between <img /><img src='asdf>' title="sf>">
This is of course inspired by #Matt Coughlin's answer.
Use the text() function, it will remove all HTML tags!
var content = $("<p>"+content+"</p>").text();
I'm in IE right now...this worked great, but my tags come out in upper case (after using innerHTML, i think) ... so I added "i" to make it case insensitive. Now Chrome and IE are happy.
var content = content.replace(/<img[^>]*>/gi,"");
Does this work for you?:
var content = content.replace(/<img[^>]*>/g, '')
You could load the text as a DOM element, then use jQuery to find all images and remove them. I generally try to treat XML (html in this case) as XML and not try to parse through the strings.
var element = $('<p>My paragraph has images like this <img src="foo"/> and this <img src="bar"/></p>');
element.find('img').remove();
newText = element.html();
console.log(newText);
To do this without regex or libraries (read jQuery), you could use DOMParser to parse your string, then use plain JS to do any manipulations and re-serialize to get back your string.

Javascript match() function returning full matched tag

console.log( html.match( /<a href="(.*?)">[^<]+<\/a>/g ));
Instead of returning just the urls like:
http://google, http://yahoo.com
It's returning the entire tag:
Google.com, Yahoo.com
Why is that the case?
You want RegExp#exec and a loop accessing the element at the match result's 1 index, rather than String.match. String.match doesn't return the capture groups when there's a g flag, just an array of the elements at index 0 of each match, which is the whole matching string. (See Section 15.5.4.10 of the spec.)
So in essence:
var re, match, html;
re = /<a href="(.*?)">[^<]+<\/a>/g;
html = 'Testing one two three one two three foo';
re.lastIndex = 0; // Work around literal bug in some implementations
for (match = re.exec(html); match; match = re.exec()) {
display(match[1]);
}
Live example
But this is parsing HTML with regular expressions. Here There Be Dragons.
Update re dragons, here's a quick list of things that will defeat this regexp, off the top of my head:
Anything other than exactly one space between the a and href, such as two spaces rather than one, a line break, class='foo', etc., etc.
Using single quotes rather than double quotes around the href attribute.
Not using quotes around the href attribute at all.
Anything after the href attribute that also uses double quotes, e.g.:
<a href="http://google.com" class="foo">
This is not to be down on your regexp, it's just to highlight that regular expressions can't be reliably used on their own to parse HTML. They can form part of the solution, helping you scan for tokens, but they can't reliably do the whole job.
While it is true you cannot reliably _parse_ HTML using regular expressions, this is not what the OP is asking.
Rather, the OP requires a way to extract anchor links from an HTML document which is easily and admirably handled using regular expressions.
Of the four problems listed by the previous responder:
multiple spaces between parts of the anchor
using single rather than double quotation marks
not using quotation marks at all to delimit the href attribute
having other leading or trailing attributes other than href
Only number 3 poses significant problems for a single regular expression solution, but also happens to be completely non-standard HTML which should never appear in an HTML document. (Note if you find HTML that contains non-delimited tag properties, there is a regular expression that will match them, but I maintain they aren't worth extracting. YMMV - Your mileage may vary.)
To extract anchor links (hrefs) using regular expressions from HTML, you would use this regular expression (in commented form):
< # a literal '<'
a # a literal 'a'
[^>]+? # one or more chars which are not '>' (non-greedy)
href= # literal 'href='
('|") # either a single or double-quote captured into group #1
([^\1]+?) # one or more chars that are not the group #1, captured into group #2
\1 # whatever capture group #1 matched
which, without comments, is:
<a[^>]+?href=('|")([^\1]+?)\1
(Note that we do not need to match anything past the final delimiter, including the rest of the tag, since all we are interested in is the anchor link.)
In JavaScript and assuming 'source' contains the HTML from which you wish to extract anchor links:
var source='<a href="double-quote test">\n'+
'<a href=\'single-quote test\'>\n'+
'<a class="foo" href="leading prop test">\n'+
'<a href="trailing prop test" class="foo">\n'+
'<a style="bar" link="baz" '+
'name="quux" '+
'href="multiple prop test" class="foo">\n'+
'<a class="foo"\n href="inline newline test"\n style="bar"\n />';
which, when printed to the console, reads as:
<a href="double-quote test">
<a href='single-quote test'>
<a class="foo" href="leading prop test">
<a href="trailing prop test" class="foo">
<a style="bar" link="baz" name="quux" href="multiple prop test" class="foo">
<a class="foo"
href="inline newline test"
style="bar"
/>
you would write the following:
var RE=new RegExp(/<a[^>]+?href=('|")([^\1]+?)\1/gi),
match;
while(match=RE.exec(source)) {
console.log(match[2]);
}
which prints the following lines to the console:
double-quote test
single-quote test
leading prop test
trailing prop test
multiple prop test
inline newline test
Notes:
Code tested in nodejs v0.5.0-pre but should run under any modern JavaScript.
Since the regular expression uses capture group #1 to note the leading delimiting quote, the resulting link appears in capture group #2.
You might wish to validate the existence, type and length of match using:
if(match && typeof match === 'object' && match.length > 1) {
console.log(match[2]);
}
but it really shouldn't be necessary since RegExp.exec() returns 'null' on failure. Also, note that the correct typeof match is 'object', not 'Array'.

Categories