Regex character count, but some count for three - javascript

I'm trying to build a regular expression that places a limit on the input length, but not all characters count equal in this length. I'll put the rationale at the bottom of the question. As a simple example, let's limit the maximum length to 12 and allow only a and b, but b counts for 3 characters.
Allowed are:
aa (anything less than 12 is fine).
aaaaaaaaaaaa (exactly 12 is fine).
aaabaaab (6 + 2 * 3 = 12, which is fine).
abaaaaab (still 6 + 2 * 3 = 12).
Disallowed is:
aaaaaaaaaaaaa (13 a's).
bbbba (1 + 4 * 3 = 13, which is too much).
baaaaaaab (7 + 2 * 3 = 13, which is too much).
I've made an attempt that gets fairly close:
^(a{0,3}|b){0,4}$
This matches on up to 4 clusters that may consist of 0-3 a's or one b.
However, it fails to match on my last positive example: abaaaaab, because that forces the first cluster to be the single a at the beginning, consumes a second cluster for the b, then leaves only 2 more clusters for the rest, aaaaab, which is too long.
Constraints
Must run in JavaScript. This regex is supplied to Qt, which apparently uses JavaScript's syntax.
Doesn't really need to be fast. In the end it'll only be applied to strings of up to 40 characters. I hope it validates within 50ms or so, but slightly slower is acceptable.
Rationale
Why do I need to do this with a regular expression?
It's for a user interface in Qt via PyQt and QML. The user can type a name in a text field here for a profile. This profile name is url-encoded (special characters are replaced by %XX), and then saved on the user's file system. We encounter problems when the user types a lot of special characters, such as Chinese, which then encode to a very long file name. Turns out that at somewhere like 17 characters, this file name becomes too long for some file systems. The URL-encoding encodes as UTF-8, which has up to 4 bytes per character, resulting in up to 12 characters in the file name (as each of these gets percent-encoded).
16 characters is too short for profile names. Even some of our default names exceed that. We need a variable limit based on these special characters.
Qt normally allows you to specify a Validator to determine which values are acceptable in a text box. We tried implementing such a validator, but that resulted in a segfault upstream, due to a bug in PyQt. It can't seem to handle custom Validator implementations at the moment. However, PyQt also exposes three built-in validators. Two apply only to numbers. The third is a regex validator that allows you to put a regular expression that matches all valid strings. Hence the need for this regular expression.

There is no real straightforward way to do this, given the limitations of regexp. You're going to have to test for all combinations, such as thirteen b with up to one a, twelve b with up to four a, and so on. We will build a little program to generate these for us. The basic format for testing for up to four a will be
/^(?=([^a]*a){0,4}[^a]*$)/
We'll write a little routine to create these lookaheads for us, given some letter and a minimum and maximum number of occurrences:
function matchLetter(c, m, n) {
return `(?=([^${c}]*${c}){${m},${n}}[^${c}]*$)`;
}
> matchLetter('a', 0, 4)
< "(?=([^a]*a){0,4}[^a]*$)"
We can combine these to test for three b with up to three a:
/^(?=([^b]*b){3}[^b]*$)(?=([^a]*a){0,3}[^a]*$)/
We will write a function to create such combined lookaheads which matches exactly m occurrences of c1 and up to n occurrences of c2:
function matchTwoLetters(c1, m, c2, n) {
return matchLetter(c1, m, m) + matchLetter(c2, 0, n);
}
We can use this to match exactly twelve b and up to four a, for a total of forty or less:
> matchTwoLetters('b', 12, 'a', 1, 4)
< "(?=([^b]*b){12,12}[^b]*$)(?=([^a]*a){0,4}[^a]*$)"
It remains to simply create versions of this for each count of b, and glom them together (for the case of a max count of 12):
function makeRegExp() {
const res = [];
for (let bs = 0; bs <= 4; bs++)
res.push(matchTwoLetters('b', bs, 'a', 12 - bs*3));
return new RegExp(`^(${res.join('|')})`);
}
> makeRegExp()
< "^((?=([^b]*b){0,0}[^b]*$)(?=([^a]*a){0,12}[^a]*$)|(?=([^b]*b){1,1}[^b]*$)(?=([^a]*a){0,9}[^a]*$)|(?=([^b]*b){2,2}[^b]*$)(?=([^a]*a){0,6}[^a]*$)|(?=([^b]*b){3,3}[^b]*$)(?=([^a]*a){0,3}[^a]*$)|(?=([^b]*b){4,4}[^b]*$)(?=([^a]*a){0,0}[^a]*$))"
Now you can do the test with
makeRegExp().test("baabaaa");
For the case of length=40, the regxp is 679 characters long. A very rough benchmark shows that it executes in under a microsecond.

If you want to count bytes when multibyte encoding is present, you can use this function:
function bytesLength(str) {
var s = str.length;
for (var i = s-1; i > -1; i--) {
var code = str.charCodeAt(i);
if (code > 0x7f && code <= 0x7ff) {s++;}
else if (code > 0x7ff && code <= 0xffff) {s+=2;}
if (code >= 0xDC00 && code <= 0xDFFF) {i--;}
}
return s;
}
console.log(bytesLength('敗')); // length 3

Try using something like this:
^((a{1,3}|b){1,4}|(a{1,4}|a?b|ba){1,3}|((a{2,3}|b){2}|aaba|abaa){2})$
Example: https://regex101.com/r/yTTiEX/6
This breaks it up into the logical possibilities:
4 parts, each with a value up to 3.
3 parts, each with a value up to 4.
2 parts, each with a value up to 6.

Related

JavaScript Regex not matching mobile number with international code

Am trying to validate a mobile number 254777123456 against a regex /^((254|255)[0-9]+){9,15}$/, the mobile number should be prefixed with the country codes specified but the total length of the mobile number should not be more than 15 characters, doing this via javascript am getting null, can anyone point out what am doing wrong.
PS. Am using way more country codes than the ones I specified, I just put those two as a test before I add the others because they will all be separated by the pipe.
Your regex ^((254|255)[0-9]+){9,15}$ means, that pick at least 4 digits (of which first 3 should be either 254 or 255) and whole of them must occur at least 9 times to max 15 times, which will mean the minimum length of string that will match should be of 36 characters. Which obviously you don't want. Your regex needs little correction where you need to take [0-9] part out and have {9,12} quantifier separately. Correct regex to be used should be this,
^(?:(?:254|255)[0-9]{9,12})$
This regex will match 254 or 255 separately and will restrict remaining number to match from 9 to 12 (as you want max number to be matched of length 15 where 3 numbers we have already separated out)
Demo
var nums = ['254777123456','255777123456','255777123456123','2557771234561231']
for (n of nums) {
console.log(n + " --> " + /^(?:(?:254|255)[0-9]{9,12})$/g.test(n));
}

Javascript - Validation on input - Can I check if character x is number and character y is letter

I have a SAPUI5 project and have an input which I'm wanting to do some validation on.
It is inputting an 8 character length entry with a mixture of numbers and letters. I have the input max length at 8 characters already.
What I need is to create a function to only allow entry of either a number or a letter at certain characters in the entry (see below).
1 - Letter
2 - Letter
3 - Number
4 - Letter
5 - Number
6 - Number
7 - Number
8 - Number
eg, BP1A8123
Is there a way to do that in JavaScript? I've come across examples of letters only or number only but can't find an example of number / letters at certain characters within the entry.
Any direction would be appreciated.
From a JavaScript perspective, you could do the check like this:
pattern = RegExp('[A-Z]{2}[0-9]{1}[A-Z]{1}[0-9]{4}')
pattern.test('BP1A8123') // true
pattern.text('B1111111') // false
From a SAPUI5 perspective, input validation is generally handled via the data binding. See, for example, this SAPUI5 Demokit Sample. Also see the validationError and validationSuccess events of class `sap.ui.core.Core'.
However, for your specific requirement, the sap.m.MaskInput control may be what you're after. See this Demokit example.
So we have the template:
template = ['letter', 'letter', 'number', 'letter',
'number', 'number', 'number', 'number']
... and some input:
input = 'BP1A8123'
We can then just use isNaN() to check whether the respective characters are letters:
input.split('').every((c,i) => isNaN(c) || template[i] == 'number')
which gives true in this instance.
You can split the string using .split('') (see here), parse the character into int to check if the variable is a number using isNaN (see here).

Numeric Regex Expression for Detection Using Javascript

I am completely new to regex hence the long question.
I would like to know about the regex expression codes to detect different types of numbers in a html paragraph tag.
Integer number (eg: 0 , 1,000 , 1000 , 028, -1 , etc)
Floating number (eg: 2.3 , 2.13 , 0.18 , .18 , -1.2 , etc)
or regex that can combine both 1. & 2. -- all integer and float number together will be so good! I tried some solution in Stackoverflow but the results are always undefined/null, else not detectable already
Ratio (eg: 1:3:4 detect as a whole if possible)
Fractional number (eg: 0/485 , 1/1006 , 2b/3 , etc)
Percentage number (eg: 15.5% , (15.5%) , 15% , 0.9%, .9%)
Also, would like to know if regex can detect symbols and numbers together in a whole (15.5% , 1:3:4), or must they be split into different parts before the detection of number can be performed (eg: 15.5 + % , 1 + : + 3 + : + 4 ) ?
These different expressions are meant to be written into Javascript code as different exceptions of cases later on. The expressions are planned to be used like the regex that detects basic integer in attached Javascript snippet below:
var paragraphText = document.getElementById("detect").innerHTML;
var allNumbers = paragraphText.match( /\d+/g ) + '';
var numbersArray = allNumbers.split(',');
for (i = 0; i < numbersArray.length; i++) {
//console.log(numbersArray[i]);
numbersArray[i] = "<span>" + numbersArray[i] + "</span>";
console.log(numbersArray[i]);
}
});
Thank you very much for your help!
The following are simple implementations:
'2,13.00'.match(/[.,\d]+/g) // 1 & 2
'1:3:4'.match(/[:\d]+/g) // 3
'0/485'.match(/[\/\d]+/g) // 4
'15.5%'.match(/[.%\d]+/g) // 5
You can loop through them using for statement, and check if one is detected and break, or continue otherwise.
For decimals numbers:
-> ((?:\d+|)(?:\.|)(?:\d+))
For percentage numbers : It is the same as decimal numbers followed by % symbol
-> ((?:\d+|)(?:\.|)(?:\d+))%
For whole numbers: the following regex would work and would exclude any decimal numbers as well, returning you just the integers
-> (^|[^\d.])\b\d+\b(?!\.\d)
For the ration requirement, I have created a complicated one, but you would get the entire ratio as a whole.
-> (((?:\d+|)(?:\.|)(?:\d+)):)*((?:\d+|)(?:\.|)(?:\d+))

What does Math.random() do in this JavaScript snippet?

I'm watching this Google I/O presentation from 2011 https://www.youtube.com/watch?v=M3uWx-fhjUc
At minute 39:31, Michael shows the output of the closure compiler, which looks like the code included below.
My question is what exactly is this code doing (how and why)
// Question #1 - floor & random? 2147483648?
Math.floor(Math.random() * 2147483648).toString(36);
var b = /&/g,
c = /</g,d=/>/g,
e = /\"/g,
f = /[&<>\"]/;
// Question #2 - sanitizing input, I get it...
// but f.test(a) && ([replaces]) ?
function g(a) {
a = String(a);
f.test(a) && (
a.indexOf("&") != -1 && (a = a.replace(b, "&")),
a.indexOf("<") != -1 && (a = a.replace(c, "<")),
a.indexOf(">") != -1 && (a = a.replace(d, ">")),
a.indexOf('"') != -1 && (a = a.replace(e, """))
);
return a;
};
// Question #3 - void 0 ???
var h = document.getElementById("submit-button"),
i,
j = {
label: void 0,
a: void 0
};
i = '<button title="' + g(j.a) + '"><span>' + g(j.label) + "</span></button>";
h.innerHTML = i;
Edit
Thanks for the insightful answers. I'm still really curious about the reason why the compiler threw in that random string generation at the top of the script. Surely there must be a good reason for it. Anyone???
1) This code is pulled from Closure Library. This code in is simply creating random string. In later version it has been replaced by to simply create a large random integer that is then concatenated to a string:
'closure_uid_' + ((Math.random() * 1e9) >>> 0)
This simplified version is easier for the Closure Compiler to remove so you won't see it leftover like it was previously. Specifically, the Compiler assumes "toString" with no arguments does not cause visible state changes. It doesn't make the same assumption about toString calls with parameters, however. You can read more about the compiler assumptions here:
https://code.google.com/p/closure-compiler/wiki/CompilerAssumptions
2) At some point, someone determined it was faster to test for the characters that might need to be replaced before making the "replace" calls on the assumption most strings don't need to be escaped.
3) As others have stated the void operator always returns undefined, and "void 0" is simply a reasonable way to write "undefined". It is pretty useless in normal usage.
1) I have no idea what the point of number 1 is.
2) Looks to make sure that any symbols are properly converted into their corresponding HTML entities , so yes basically sanitizing the input to make sure it is HTML safe
3) void 0 is essentially a REALLY safe way to make sure it returns undefined . Since the actual undefined keyword in javascript is mutable (i.e. can be set to something else), it's not always safe to assume undefined is actually equal to an undefined value you expect.
When in doubt, check other bases.
2147483648 (base 10) = 0x80000000 (base 16). So it's just making a random number which is within the range of a 32-bit signed int. floor is converting it to an actual int, then toString(36) is converting it to a 36-character alphabet, which is 0-9 (10 characters) plus a-z (26 characters).
The end-result of that first line is a string of random numbers and letters. There will be 6 of them (36^6 = 2176782336), but the first one won't be quite as random as the others (won't be late in the alphabet). Edit: Adrian has worked this out properly in his answer; the first letter can be any of the 36 characters, but is slightly less likely to be Z. The other letters have a small bias towards lower values.
For question 2, if you mean this a = String(a); then yes, it is ensuring that a is a string. This is also a hint to the compiler so that it can make better optimisations if it's able to convert it to machine code (I don't know if they can for strings though).
Edit: OK you clarified the question. f.test(a) && (...) is a common trick which uses short-circuit evaluation. It's effectively saying if(f.test(a)){...}. Don't use it like that in real code because it makes it less readable (although in some cases it is more readable). If you're wondering about test, it's to do with regular expressions.
For question 3, it's new to me too! But see here: What does `void 0` mean? (quick google search. Turns out it's interesting, but weird)
There's a number of different questions rolled into one, but considering the question title I'll just focus on the first here:
Math.floor(Math.random() * 2147483648).toString(36);
In actual fact, this doesn't do anything - as the value is discarded rather than assigned. However, the idea of this is to generate a number between 0 and 2 ^ 31 - 1 and return it in base 36.
Math.random() returns a number from 0 (inclusive) to 1 (exclusive). It is then multipled by 2^31 to produce the range mentioned. The .toString(36) then converts it to base 36, represented by 0 to 9 followed by A to Z.
The end result ranges from 0 to (I believe) ZIK0ZI.
As to why it's there in the first place ... well, examine the slide. This line appears right at the top. Although this is pure conjecture, I actually suspect that the code was cropped down to what's visible, and there was something immediately above it that this was assigned to.

Flag bit computation and detection

In some code I'm working on I should take care of ten independent parameters which can take one of two values (0 or 1). This creates 2^10 distinct conditions. Some of the conditions never occur and can be left out, but those which do occur are still A LOT and making a switch to handle all cases is insane.
I want to use 10 if statements instead of a huge switch. For this I know I should use flag bits, or rather flag bytes as the language is javascript and its easier to work with a 10 byte string with to represent a 10-bit binary.
Now, my problem is, I don't know how to implement this. I have seen this used in APIs where multiple-selectable options are exposed with numbers 1, 2, 4, 8, ... , n^(n-1) which are decimal equivalents of 1, 10, 100, 1000, etc. in binary. So if we make call like bar = foo(7), bar will be an object with whatever options the three rightmost flags enable.
I can convert the decimal number into binary and in each if statement check to see if the corresponding digit is set or not. But I wonder, is there a way to determine the n-th digit of a decimal number is zero or one in binary form, without actually doing the conversion?
Just use a bitwise-and. In C/C++, this would be:
if (flags & 1) {
// Bit zero is set.
}
if (flags & 2) {
// Bit one is set.
}
if (flags & 4) {
// Bit two is set.
}
...
For production goodness, use symbolic names for the flag masks instead of the magic numbers, 1, 2, 4, 8, etc.
If the flags are homogeneous in some way (e.g., they represent ten spatial dimensions in some geometry problem) and the code to handle each case is the same, you can use a loop:
for (int f = 0; f < 10; ++f) {
if (flags & (1 << f)) {
// Bit f is set.
}
}
You can use a bitwise and:
10 & 2^1 is true because 10 = 1010b
^ 1
8 & 2^1 is false because 8 = 1000b
^ 0
10 & 2^3 is true because 10 = 1010b
^ 1
You could get a number that has the n-th bit set and AND it with your number. If the result is zero your number didn't have the bit set. Otherwise, it did. Look here, also.

Categories