Regexp for matching numbers and units in an HTML fragment?

Regexp for matching numbers and units in an HTML fragment? - javascript

I'm trying to make a regexp that will match numbers, excluding numbers that are part of other words or numbers inside certain html tags. The part for matching numbers works well but I can't figure out how to find the numbers inside the html.
Current code:
//number regexp part
var prefix = '\\b()';//for future use
var baseNumber = '((\\+|-)?([\\d,]+)(?:(\\.)(\\d+))?)';
var SIBaseUnit = 'm|kg|s|A|K|mol|cd';
var SIPrefix = 'Y|Z|E|P|T|G|M|k|h|ia|d|c|m|µ|n|p|f|a|z|y';
var SIUnit = '(?:('+SIPrefix+')?('+SIBaseUnit+'))';
var generalSuffix = '(PM|AM|pm|am|in|ft)';
var suffix = '('+SIUnit+'|'+generalSuffix+')?\\b';
var number = '(' + prefix + baseNumber + suffix + ')';
//trying to make it match only when not within tags or inside excluded tags
var htmlBlackList = 'script|style|head'
var htmlStartTag = '<[^(' + htmlBlackList + ')]\\b[^>]*?>';
var reDecimal = new RegExp(htmlStartTag + '[^<]*?' + number + '[^>]*?<');

<script>
var htmlFragment = "<script>alert('hi')</script>";
var style = "<style>.foo { font-size: 14pt }</style>";
// ...
</script>
<!-- turn off this style for now
<style> ... </style>
-->
Good luck getting a regular expression to figure that out.
You're using JavaScript, so I'm guessing you're probably running in a browser. Which means you have access to the DOM, giving you access to the browser's very capable HTML parser. Use it.

The [^] regex modifier only works on single characters, not on compound expressions like (script|style|head). What you want is ?! :
var htmlStartTag = '<(?!(' + htmlBlackList + ')\\b)[^>]*?>';
(?! ... ) means 'not followed by ...' but [^ ... ] means 'a single character not in ...'.

I'm trying to make a regexp that will match numbers, excluding numbers that are part of other words or numbers inside certain html tags.
Regex cannot parse HTML. Do not use regex to parse HTML. Do not pass Go. Do not collect £200.
To ‘only match something not-within something else’ you would need a negative lookbehind assertion (“(?<!”), but JavaScript Regexps do not support lookbehind, and most other regex implementations don't support the complex variable-length lookbehind you'd need to have any hope of matching a context like being inside a tag. Even if you did have variable-length lookbehind, that'd still not reliably parse HTML, because as previously mentioned many times every day, regex cannot parse HTML.
Use an HTML parser. A browser HTML parser will be able to digest even partial input without complaining.

Related

RegEx working in JavaScript but not in C#

I currently have a working WordWrap function in Javascript that uses RegEx. I pass the string I want wrapped and the length I want to begin wrapping the text, and the function returns a new string with newlines inserted at appropriate locations in the string as shown below:
wordWrap(string, width) {
let newString = string.replace(
new RegExp(`(?![^\\n]{1,${width}}$)([^\\n]{1,${width}})\\s`, 'g'), '$1\n'
);
return newString;
}
For consistency purposes I won't go into, I need to use an identical or similar RegEx in C#, but I am having trouble successfully replicating the function. I've been through a lot of iterations of this, but this is what I currently have:
private static string WordWrap(string str, int width)
{
Regex rgx = new Regex("(?![^\\n]{ 1,${" + width + "}}$)([^\\n]{1,${" + width + "}})\\s");
MatchCollection matches = rgx.Matches(str);
string newString = string.Empty;
if (matches.Count > 0)
{
foreach (Match match in matches)
{
newString += match.Value + "\n";
}
}
else
{
newString = "No matches found";
}
return newString;
}
This inevitably ends up finding no matches regardless of the string and length I pass. I've read that the RegEx used in JavaScript is different than the standard RegEx functionality in .NET. I looked into PCRE.NET but have had no luck with that either.
Am I heading in the right general direction with this? Can anyone help me convert the first code block in JavaScript to something moderately close in C#?
edit: For those looking for more clarity on what the working function does and what I am looking for the C# function to do: What I am looking to output is a string that has a newline (\n) inserted at the width passed to the function. One thing I forgot to mention (but really isn't related to my issue here) is that the working JavaScript version finds the end of the word so it doesn't cut up the word. So for example this string:
"This string is really really long so we want to use the word wrap function to keep it from running off the page.\n"
...would be converted to this with the width set to 20:
"This string is really \nreally long so we want \nto use the word wrap \nfunction to keep it \nfrom running off the \npage.\n"
Hope that clears it up a bit.

JavaScript and C# Regex engines are different. Also each language has it's own regex pattern executor, so Regex is language dependent. It's not the case, if it is working for one language so it will work for another.
C# supports named groups while JavaScript doesn't support them.
So you can find multiple difference between these two languages regex.

There are issues with the way you've translated the regex pattern from a JavaScript string to a C# string.
You have extra whitespace in the c# version, and you've also left in $ symbols and curly brackets { that are part of the string interpolation syntax in the JavaScript version (they are not part of the actual regex pattern).
You have:
"(?![^\\n]{ 1,${" + width + "}}$)([^\\n]{1,${" + width + "}})\\s"
when what I believe you want is:
"(?![^\\n]{1," + width + "}$)([^\\n]{1," + width + "})\\s"

Manipulate Javascript RegExp occurrences in string

I'm converting phone numbers into a clickable url. I have an html string with random phone numbers appearing in a non particular/consistent way:
Fidel Velazquez (834)316-90-90 ↵Libertad (834) 316-2930 ↵
I'm using regex in order to search the occurrences and "linkify" them, this is the function that helps with that:
var regex = new RegExp(
"\\+?\\(?\\d*\\)? ?\\(?\\d+\\)?\\d*([\\s./-]?\\d{2,})+",
"g"
);
return inputText.replace(regex, '$&');
The problem is, that for the example above, I'm getting a white space before the phone number, and that is preventing the html link from working correctly (i.e I'm getting " (834)316-90-90" instead of "(834)316-90-90").
Is there something I can change directly on my regex? Or is there a way to apply a .replace(' ', ''); only to the occurrences?
Thanks!

You can use a function for replacement, allowing you to customise the replacement string.
var inputText = `Fidel Velazquez (834)316-90-90
Libertad (834) 316-2930
`;
var regex = new RegExp(
"\\+?\\(?\\d*\\)? ?\\(?\\d+\\)?\\d*([\\s./-]?\\d{2,})+",
"g"
);
var output = inputText.replace(regex, function(m) {
var match = m.replace(/ /g, '');
return `${m}`;
});
console.log(output);
Or you can construct a more careful regular expression, taking the original apart, and putting it together in a different way, using capture groups. This is more restricted, but sufficient for what you described. (Also, new RegExp is not idiomatic, you'd only use it when you have dynamically generated regular expression.)
For your specific trouble, you could even require the space to be matched only when the previous thing matches... Many different solutions, depending exactly what your requirement is. Note that this regular expression will also match some things that do not look like phone numbers:

match css rules in javascript

I need to create a regular expression to find class inside a css file.
For example I have this css file:
#label-blu{
}
.label-blu, .test{
}
.label-blu-not-match{
}
.label-blu{
}
.label-blu span{
}
In this case I need to return 3 match
This is my regular expression:
var css = data;
var find_css = 'label-blu';
var found = css.match(/([#|\.]?)([\w|:|\s|\.]+)/gmi).length;
console.log('found: ' + found);
Inside var data there is all the css string
How can I solve?
Thanks

There are two points:
("word-does-not-include-hyphen").replace(/\w+/g, 'test')
And are you sure you should be matching against css label text label-blu? rather than the full css text itself? Currently you are finding the separations across the hyphen for label-blu...
var css = 'label-blu';
var found = css.match(/([#|\.]?)([\w|:|\s|\.]+)/gmi);
/// which gives ['label','blu']
Which is the reason for the returned length of two, rather than three. Were you not hoping to match the three items in the css text i.e
#label-blu
.label-blu-not-match
.label-blu
If so you will need to use a different text to match, the entire css, rather than just the string 'label-blue'.
However if you are trying to match:
#label-blu
.label-blu, .test
.label-blu
.label-blu span
Then you will need a different RegExp and the entire css string. Just need clarification on which route you need?
update
It's still not clear exactly out of the css text what you wish to match, this is the reason why I have outlined exactly. However, on the assumption you want to match the last four items I mention (and assuming you don't wish to match label-blu-not-match) then the following should help:
http://jsfiddle.net/5d7JX/
var found = csstext.match(/[#\.]label-blu([,:\s\.][^\{]*)?\{/gmi);
However the above is not full-proof for all possible css formats, nor does it protect against matches within the css rule-sets themselves. Generally speaking scanning through code that is usually quite complicated to parse into something logical using only Regular Expressions is frowned upon; unless you are solving a very specific use-case.
update 2
Yes excluding the ID selectors just involves removing the # part of the Reg Exp...
var found = csstext.match(/\.label-blu([,:\s\.][^\{]*)?\{/gmi);
I recommend that you read up on your regular expressions, this site is a good place:
http://www.regular-expressions.info/
update 3
To include a variable as part of a regular expression you will need to make sure you escape the characters to make the string literal, so any special characters wont interfere. As far as I'm aware there isn't a built in function to escape or quote for regular expressions in JavaScript; however you can find one here:
How to escape regular expression in javascript?
So if you add this to your code:
RegExp.quote = function(str) {
return (str+'').replace(/([.?*+^$[\]\\(){}|-])/g, "\\$1");
};
You then also need to convert your regexp to the object equivalent:
var reg = new RegExp('\\.label-blu([,:\\s\\.][^\\{]*)?\\{', 'gmi');
var found = csstext.match(reg);
And then add this:
var label = 'label-blu';
var reg = new RegExp('\\.' + RegExp.quote(label) + '([,:\\s\\.][^\\{]*)?\\{', 'gmi');
var found = csstext.match(reg);
http://jsfiddle.net/5d7JX/1/

In your example if you use:
var findClass = /(\.label-blu)(?!-)+/g;
var found = css.match(findClass).length;
should return 3...
maybe a better solution is:
var findClass = /(\.label-blu)[\s{,]+/g;
var found = css.match(findClass).length;
to cover a possibility when you might have something else rather than '-' added to your wanted class and it will only look for the class that's followed by a 'space' a '{' or a ','...
let me know if you have any questions

Use Greasemonkey to add bold tags to dates on a page?

I have a Greasemonkey script that prints a div -- works! However, I'd like to be able to add bold tags to all dates in this div.
Dates are formatted MM/DD/YYYY
So something like:
var regex = '\d{2}\/\d{2}\/\d{4}';
Then how would I perform the search replace? If the div was called loanTable:
Non-working concept:
$("#loanTable").html().replace( regex, "<b>" regex "</b>" )
Something like the above should work but I'm not sure of the exact syntax for this.

Use a regex capture group:
var loanTable = $("#loanTable")
var loanHTML = loanTable.html ().replace (/(\d{2}\/\d{2}\/\d{4})/g, "<b>$1</b>");
loanTable.html (loanHTML);

This piece of code is not valid JS:
var regex = '\d{2}\/\d{2}\/\d{4}';
$("#loanTable").html().replace( regex, "<b>" regex "</b>" )
The syntax for regex is /regex/, non quoted, or new Regex('regex') with quotes.
Start by assigning the html to a variable. Also <b> is barely used anymore, <strong> is the new standard. Then, replace() takes a regex and a string or function as parameters. To replace multiple times you have to use the g flag. Finally, to do what you want to accomplish you can use replacement tokens, like $1 etc...
var re = /\d{2}\/\d{2}\/\d{4}/g; // 'g' flag for 'global';
var html = $("#loanTable").html();
$("#loanTable").html(html.replace(re, '<strong>$&</strong>')); // The `$&` token returns the whole match

Last time I used GreaseMonkey, it wasn't easy to get jQuery to run in your user scripts.
Use the following code to do it without jQuery:
var loanTable = document.getElementById('loanTable');
loanTable.innerHTML = loanTable.innerHTML.replace(/(\d{1,2}\/\d{1,2}\/\d{4})/g, "<b>$1</b>");

One small aspect of this: you need to concatenate strings with a + operator:
$("#loanTable").html().replace( regex, "<b>" + regex + "</b>" )

jquery / javascript: regex to replace instances of an html tag

I'm trying to take some parsed XML data, search it for instances of the tag and replace that tag (and anything that may be inside the font tag), and replace it with a simple tag.
This is how I've been doing my regexes:
var emailReg = /^([\w-\.]+#([\w-]+\.)+[\w-]{2,4})?$/; //Test against valid email
console.log('regex: ' + emailReg.test(testString));
and so I figured the font regex would be something like this:
var fontReg = /'<'+font+'[^><]*>|<.'+font+'[^><]*>','g'/;
console.log('regex: ' + fontReg.test(testString));
but that isn't working. Anyone know a way to do this? Or what I might be doing wrong in the code above?

I think namuol's answer will serve you better then any RegExp-based solution, but I also think the RegExp deserves some explanation.
JavaScript doesn't allow for interpolation of variable values in RegExp literals.
The quotations become literal character matches and the addition operators become 1-or-more quantifiers. So, your current regex becomes capable of matching these:
# left of the pipe `|`
'<'font'>
'<''''fontttt'>
# right of the pipe `|`
<#'font'>','g'
<#''''fontttttt'>','g'
But, it will not match these:
<font>
</font>
To inject a variable value into a RegExp, you'll need to use the constructor and string concat:
var fontReg = new RegExp('<' + font + '[^><]*>|<.' + font + '[^><]*>', 'g');
On the other hand, if you meant for literal font, then you just needed:
var fontReg = /<font[^><]*>|<.font[^><]*>/g;
Also, each of those can be shortened by using .?, allowing the halves to be combined:
var fontReg = new RegExp('<.?' + font + '[^><]*>', 'g');
var fontReg = /<.?font[^><]*>/g;

If I understand your problem correctly, this should replace all font tags with simple span tags using jQuery:
$('font').replaceWith(function () {
return $('<span>').append($(this).contents());
});
Here's a working fiddle: http://jsfiddle.net/RhLmk/2/

We Keep Coding

JavaScript is the programming language of the Web.

Regexp for matching numbers and units in an HTML fragment? - javascript

The [^] regex modifier only works on single characters, not on compound expressions like (script|style|head). What you want is ?! : var htmlStartTag = '<(?!(' + htmlBlackList + ')\\b)[^>]*?>'; (?! ... ) means 'not followed by ...' but [^ ... ] means 'a single character not in ...'.

Related

RegEx working in JavaScript but not in C#

Manipulate Javascript RegExp occurrences in string

match css rules in javascript

Use Greasemonkey to add bold tags to dates on a page?

jquery / javascript: regex to replace instances of an html tag

Categories

Resources