VB.Net remove whitespace with regex excluding content inside <script> - javascript

I want to reduce the size out my HTML output stream by removing all empty lines and whitespace. However I'm not very good at regex and the pattern I have seems to remove more than wanted e.g. whole script blocks. How can I make sure that blocks are kept in tact?
This is what I have so far:
html = Regex.Replace(html, ">\s+<", "><", RegexOptions.Compiled)

I think you're looking for conditional regex. Look at examples here Regex Tutorial If-Then-Else
There are different regex for different systems (.Net, Python, etc)

Related

Regex replace with multiple wildcards works in PHP, not in JavaScript

I'm attempting to implement center alignment for two Markdown parsers:
In PHP for Parsedown (successfully)
In JavaScript for Bootstrap Markdown (without success)
The idea I'm following and finding the easiest is to work with the final HTML output, and just snap inline styling onto the tags.
The following regex does what I need, it adds style="text-align:center;" to any element so far*, as needed:
$text = preg_replace('/\<(.*?)\>\->(.*?)<\-\<\/(.*?)\>/', '<$1 style="text-align:center;">$2</$3>', $text);
That is, <p>text</p> becomes <p style="text-align:center;">text</p>.
However, when I attempted to port this into JavaScript to also make it available for previewing on client-side, the pattern does not match as it should:
content = content.replace('/\<(.*?)\>\->(.*?)<\-\<\/(.*?)\>/', '<$1 style="text-align:center;">$2</$3>');
The replacement in content does not occur.
I'm aware there are slight differences between Regex of PHP and JavaScript, but I have found examples for all the expected behavior here on both sides, working.
*If someone is wondering by any chance, I'm also successfully adding the center alignment to tags that already have a style attribute - on server side only, so far.
You'll need to use the literal syntax for regular expression in JavaScript, like so:
content = content.replace(/\<(.*?)\>\->(.+)<\-\<\/(.+)\>/gi, '<$1 style="text-align:center;">$2</$3>');
Note that the gi at the end of the regular expression simply enables global searching (that is, replace all occurrences matching the pattern) and case-insensitive matching. They are both technically optional, but you will most likely want the g flag enabled for certain. However, keeping the i flag is up to you (depends on whether or not your content contains &GT;, for example).

regex to change text inside a html tag

First of all I'm new to stackoverflow so I'm sorry if I posted this in the wrong section.
I need a regex to search within the html tag and replace the - with a _
e.g:
<TAG-NAME>-100</TAG-NAME>
would become
<TAG_NAME>-100</TAG_NAME>
note that the value inside the tag wasn't affected.
Can anyone help?
Thanks.
Since JavaScript is the language for DOM manipulation, you should generally consider parsing the XML properly and using JavaScript's DOM traversal functions instead of regular expressions.
Here is some example code on how to parse an XML document so that you can use the DOM traversal functions. Then you can traverse all elements and change their names. This will automatically exclude text nodes, attributes, comments and all other annoying things, you don't want to change.
If it has to be a regex, here is a makeshift solution. Note that it will badly fail you if you have tags (or even only >) inside attribute names or comments (in fact it will also apply the replacement to comments):
str = str.replace(/-(?=[^<>]*>)/g, '_');
This will match a - if it is followed by a > without encountering a < before. The concept is called a negative lookahead. The g modifier makes sure that all occurrences are replaced.
Note that this will apply the replacement to anything in front of a >. Even attribute values. If you don't want that you could also make sure that there is an even number of quotes between the hyphen and the closing >, like this:
str = str.replace(/-(?=[^<>"]*(?:"[^<>"]*"[^<>"]*)*>)/g, '_');
This will still change attribute names though.
Here is a regexpal demo that shows what works and what doesn't work. Especially the comment behavior is quite horrible. Of course this could be taken care of with an even more complex regex, but I guess you see where this is going? You should really, really use an XML parser!
s/(\<[^\>]+\>)\-([^\<]+\<\/)/\1_\2/
Although I am not familiar with JS libraries, but I am pretty sure there would be better libraries to parse HTML.

Regex will not match

This is my string:
<link href="/post?page=4&tags=example" rel="last" title="Last Page">
From there I am trying to obtain the 4 out of that page parameter, using this regular expression:
link href="/post?page=(.*?)&tags=(.*?)" rel="last"
I will then collect the 4 out of the first group, the tags parameter has a wildcard because the contents can change. However, I don't seem to be getting a match with this, can anyone help?
And I know I shouldn't be using regex to parse HTML, but this is just a small thing and it would be a waste to import a huge module for this.
Assuming you are using a /regex literal/, you will need to escape the / in that path as \/.
Alternatively, it depends on how you are getting this string. Is it really typed that way, or is it part of an innerHTML that you are then reading out again? If that's the case, then the innerHTML won't be what you expect it to be, because the browser will "normalise" it.
If it is an innerHTML, then it'd be far easier to get the tag, then get the tag's href attribute, then regex that.
link href="/post\?page=(.*?)&tags=(.*?)" rel="last"
You forgot the slash before ?
I think it might be better to change your capture groups to something a little different, but will catch everything up to the terminating character:
link href="/post?page=([^&]+)&tags=([^\"]+)" rel="last"
Using the negating character first in the character group tells the regex engine "capture all characters EXCEPT the ones listed here". This makes it very easy to capture everything up until it hits a termination character, such as the amperstand and double-quote. Assuming you're using PHP or Java, this should also slightly improve regex performance.
If the page parameter always comes first, try the PCRE /\?page=(\d+)/. Match group 1 will contain the page number.

How to adding special html chars without using innerHTML

So I'm working on a micro lib, html.js, and basically it creates text nodes with document.createTextNode but when I want to create a text node with a b I get a&nbsp;b so I'm wondering how to escape the & char, without using innerHTML ideally..
Javascript supports the \uXXXX notation, so in the case of a non-breaking space, that would be \u00A0.
document.createTextNode('a\u00A0b');
That's as far as you can get. It's a text node, consisting only of text, and there's no difference between texts created from entity references or from normal characters.
If that's not what you want, you should take a second look at innerHtml. Can't you read it, modify it and put it back?
There's not much functionality in js to encode/decode html entities. Seems like there some libraries out there, though, that can help you achieve this. Here is one I found on goodle.. haven't tried it, but you can check it out, or look for others.
http://www.strictly-software.com/htmlencode

Cleaning whitespace from HTML with RegEx

Is it possible for a RegEx to clean up whitespace in HTML?
For example:
<p><b>foo</b> <i>bar</i></p>
<p>foo</p> <p>bar</p>
On the first line, the space between the closing b and opening i tag is valid (although it could be a ), however on the second line it is whitespace that I wish to clean up as it shouldn't have any semantic value.
Perhaps this would be better solved with DOM traversal?
Seems like something like HTML Tidy would be a better bet for what you're looking for - rather than needing to re-create all the potentially complex rules (such as your first whitespace in the example being significant, but not the 2nd, etc.)
Otherwise, I agree - DOM traversal would be a much better approach than regular expressions - especially if your HTML is already XHTML compliant and can be easily traversed as XML.
First I have to quote ;)
"asking regexes to parse arbitrary HTML is like asking Paris Hilton to write an operating system"
Then back to the business.
You could try different regexes to tags (although, I'd doubt this is valid method):
sed -e 's/<p>\ </<p></g'
That removes <p>(whitespace)<(whatever_tag) whitespace.
Otherwise, I too agree with the DOM traversal.

Categories