Language Code Stripping Regex

Language Code Stripping Regex - javascript

I feel like I'm pretty close to the solution here, but I can't quite seem to figure it out. My goal is to take the set of strings one at a time, [ 'en', 'en-us', 'en_us', 'zh-hans-TW' ] and produce [ 'en', 'en', 'en', 'zh-hans' ]. I've tried a few different things, but don't have quite the right solution.
This is the closest I've come, I believe, matching all but 'en'.
/([a-zA-Z-_]+)[-_].+/
(One or more of aA-zZ chars or -_ followed by - or _ and additional chars)
I tried negative lookahead (which I'm not real good at), and came up with this which over matches and captures the whole string
/([a-zA-Z-_]+)(?![-_].+)/
(One or more aA-zZ chars or -_ not followed by - or _ with additional characters)
Could someone point out the right solution here?

Instead of matching the portions of the strings you wish to keep, you could remove the ends of the strings you do not want to keep:
/[-_][a-z]+$/i
Here is an implementation in Javascript:
var array1 = [ 'en', 'en-us', 'en_us', 'zh-hans-TW' ];
var array2 = array1.map(function(str) {
return str.replace(/[-_][a-z]+$/i, "");
});
console.log(array2);
This outputs:
[ 'en', 'en', 'en', 'zh-hans' ]

You should try to be more general. For instance, de-DE-u-co-phonebk is also a valid language code (the stuff starting with -u... represents Unicode options for collation order etc.). I'm assuming you want to strip off everything starting from the country code, which by the standard is supposed to be uppercase. If you want to do this with a regexp, then
function strip_country_code(lang) { return lang.replace(/[-_][A-Z][A-Z].*$/, ''); }
Of course, this will fail on en-us, which is invalid; it should be en-US. You have to decide if and how to handle invalid language codes such as this.
That's just one reason that you would be better off using available libraries to process language codes if possible. Take a look at the JS internationalization API, which has several ways to parse locale codes and find the "best" one. However, browser support is limited. So you might want to look for something off the shelf. But I can't put my finger on anything at the moment.
The JED library uses the following regexp to extract segments:
str.match(/[a-z]+/gi)
but then assumes that the second segment if present is always the country, so this logic would fail on zh-hans-TW.
You should also consider who is going to be consuming the result of your string manipulation. Are you saying that there is some library, or API, that can only handle the part of the locale string preceding the country code? You should ensure that that is in fact the case. For instance, I believe that moment.js also will handle different locale strings properly.

Related

Parse JS code for comments

I have a small NodeJS program that I use to extract code comments from files I point it to. It mostly works, but I'm having some issues dealing with it misinterpreting certain JS strings (glob patterns) as code comments.
I'm using the regex [^:](\/\/.+)|(\/\*[\W\w\n\r]+?\*\/) to parse the following test file:
function DoStuff() {
/* This contains the value of foo.
Foo is used to display "foo"
via http://stackoverflow.com
*/
this.foo = "http://google.com";
this.protocolAgnosticUrl = "//cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/core.js";
//Show a message about foo
alert(this.foo);
/// This is a triple-slash comment!
const globPatterns = [
'path/to/**/*.tests.js',
'!my-file.js',
"!**/folder/*",
'another/path/**/*.tests.js'
];
}
Here's a live demo to help visualize what is and is not properly captured by the regex: https://regex101.com/r/EwYpQl/1
I need to be able to only locate the actual code comments here, and not the comment-like syntax that can sometimes appear within strings.

I have to agree with the comments that for most cases it is better to use a parser, even when a RegExp can do the job for a specific and well defined use case.
The problem is not that you can't make it work for that very specific use case even thought there are probably plenty of edge cases that you don't really care about, nor have to, but that may break that solution. The actual problem is that if you start building around your sub-optimal solution and your requirements evolve overtime, you will start to patch those as they appear. Someday, you may find yourself with an extensive codebase full of patches that doesn't scale anymore and the only solution will probably be to start from scratch.
Anyway, you have been warned by a few of us, and is still possible that your use case is really that simple and will not change in the future. I would still consider moving from RegExp to a parser at some point, but maybe you can use this meanwhile:
(^ +\/\/(.*))|(["'`]+.*["'`]+.*\/\/(.*))|(["'`]+.*["'`]+.*\/\*([\W\w\n\r]+?)\*\/)|(^ +\/\*([\W\w\n\r]+?)\*\/)
Just in case, I have added a few other cases, such as comments that come straight after some valid code:
Edit to prove the first point and what is being said in the comments:
I have just answered this with the previous RegExp that was solving just the issue that you pointed out in your question (your RegExp was misinterpreting strings containing glob patterns as code comments).
So, I fixed that and I even made it able to match comments that start in the same line as a valid (non-commented) statement. Just a moment after posting that I notice that this last feature will only work if that statement contains a string.
This is the updated version, but please, keep in mind that this is exactly what we are warning you about...:
(^[^"'`\n]+\/\/(.*))|(["'`]+.*["'`]+.*\/\/(.*))|(["'`]+.*["'`]+.*\/\*([\W\w\n\r]+?)\*\/)|(^[^"'`\n]+\/\*([\W\w\n\r]+?)\*\/)
How does it work?
There are 4 main groups that compose the whole RegExp, the first two for single-line comments and the next two for multi-line comments:
(^[^"'`\n]+//(.*))
(["']+.*["']+.//(.))
(["']+.*["']+.*/*([\W\w\n\r]+?)*/)
(^[^"'`\n]+/*([\W\w\n\r]+?)*/)
You will see there are some repeated patterns:
^[^"'`\n]+: From the start of a line, match anything that doesn't include any kind of quote or line break.
` is for ES2015 template literals.
Line breaks are excluded as well to prevent matching empty lines.
Note the + will prevent matching comments that are not padded with at least one space. You can try replacing it with *, but then it will match strings containing glob patterns again.
["']+.*["']+.*: This is matching anything that is between quotes, including anything that looks like a comment but it's part of a string. Whatever you match after, it will be outside that string, so using another group you can match comments.

Generating regex programmatically [duplicate]

For example, given the string "2009/11/12" I want to get the regex ("\d{2}/d{2}/d{4}"), so I'll be able to match "2001/01/02" too.
Is there something that does that? Something similar? Any idea' as to how to do it?

There is text2re, a free web-based "regex by example" generator.
I don't think this is available in source code, though. I dare to say there is no automatic regex generator that gets it right without user intervention, since this would require the machine knowing what you want.
Note that text2re uses a template-based, modularized and very generalized approach to regular expression generation. The expressions it generates work, but they are much more complex than the equivalent hand-crafted expression. It is not a good tool to learn regular expressions because it does a pretty lousy job at setting examples.
For instance, the string "2009/11/12" would be recognized as a yyyymmdd pattern, which is helpful. The tool transforms it into this 125 character monster:
((?:(?:[1]{1}\d{1}\d{1}\d{1})|(?:[2]{1}\d{3}))[-:\/.](?:[0]?[1-9]|[1][012])[-:\/.](?:(?:[0-2]?\d{1})|(?:[3][01]{1})))(?![\d])
The hand-made equivalent would take up merely two fifths of that (50 characters):
([12]\d{3})[-:/.](0?\d|1[0-2])[-:/.]([0-2]?\d|3[01])\b

It's not possible to write a general solution for your problem. The trouble is that any generator probably wouldn't know what you want to check for, e.g. should "2312/45/67" be allowed too? What about "2009.11.12"?
What you could do is write such a generator yourself that is suited for your exact problem, but a general solution won't be possible.

I've tried a very naive approach:
class RegexpGenerator {
public static Pattern generateRegexp(String prototype) {
return Pattern.compile(generateRegexpFrom(prototype));
}
private static String generateRegexpFrom(String prototype) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0; i < prototype.length(); i++) {
char c = prototype.charAt(i);
if (Character.isDigit(c)) {
stringBuilder.append("\\d");
} else if (Character.isLetter(c)) {
stringBuilder.append("\\w");
} else { // falltrought: literal
stringBuilder.append(c);
}
}
return stringBuilder.toString();
}
private static void test(String prototype) {
Pattern pattern = generateRegexp(prototype);
System.out.println(String.format("%s -> %s", prototype, pattern));
if (!pattern.matcher(prototype).matches()) {
throw new AssertionError();
}
}
public static void main(String[] args) {
String[] prototypes = {
"2009/11/12",
"I'm a test",
"me too!!!",
"124.323.232.112",
"ISBN 332212"
};
for (String prototype : prototypes) {
test(prototype);
}
}
}
output:
2009/11/12 -> \d\d\d\d/\d\d/\d\d
I'm a test -> \w'\w \w \w\w\w\w
me too!!! -> \w\w \w\w\w!!!
124.323.232.112 -> \d\d\d.\d\d\d.\d\d\d.\d\d\d
ISBN 332212 -> \w\w\w\w \d\d\d\d\d\d
As already outlined by others a general solution to this problem is impossible. This class is applicable only in few contexts

Excuse me, but what you all call impossible is clearly an achievable task. It will not be able to give results for ALL examples, and maybe not the best results, but you can give it various hints, and it will make life easy. A few examples will follow.
Also a readable output translating the result would be very useful.
Something like:
"Search for: a word starting with a non-numeric letter and ending with the string: "ing".
or: Search for: text that has bbb in it, followed somewhere by zzz
or: *Search for: a pattern which looks so "aa/bbbb/cccc" where "/" is a separator, "aa" is two digits, "bbbb" is a word of any length and "cccc" are four digits between 1900 and 2020 *
Maybe we could make a "back translator" with an SQL type of language to create regex, instead of creating it in geekish.
Here's are a few examples that are doable:
class Hint:
Properties: HintType, HintString
enum HintType { Separator, ParamDescription, NumberOfParameters }
enum SampleType { FreeText, DateOrTime, Formatted, ... }
public string RegexBySamples( List<T> samples,
List<SampleType> sampleTypes,
List<Hint> hints,
out string GeneralRegExp, out string description,
out string generalDescription)...
regex = RegExpBySamples( {"11/November/1999", "2/January/2003"},
SampleType.DateOrTime,
new HintList( HintType.NumberOfParameters, 3 ));
regex = RegExpBySamples( "123-aaaaJ-1444",
SampleType.Format, HintType.Seperator, "-" );
A GUI where you mark sample text or enter it, adding to the regex would be possible too.
First you mark a date (the "sample"), and choose if this text is already formatted, or if you are building a format, also what the format type is: free text, formatted text, date, GUID or Choose... from existing formats (which you can store in library).
Lets design a spec for this, and make it open source... Anybody wants to join?

Loreto pretty much does this. It's an open source implementation using the common longest substring(s) to generate the regular expressions. Needs multiple examples of course, though.

No, you cannot get a regex that matches what you want reliably, since the regex would not contain semantic information about the input (i.e. it would need to know it's generating a regex for dates). If the issue is with dates only I would recommend trying multiple regular expressions and see if one of them matches all.

I'm not sure if this is possible, at least not without many sample strings and some learning algorithm.
There are many regex' that would match and it's not possible for a simple algorithm to pick the 'right' one. You'd need to give it some delimiters or other things to look for, so you might as well just write the regex yourself.

sounds like a machine learning problem. You'll have to have more than one example on hand (many more) and an indication of whether or not each example is considered a match or not.

I don't remember the name but if my theory of computation cells serve me right its impossible in theory :)

I haven't found anything that does it , but since the problem domain is relatively small (you'd be surprised how many people use the weirdest date formats) , I've able to write some kind of a "date regular expression generator".
Once I'm satisfied with the unit tests , I'll publish it - just in case someone will ever need something of the kind.
Thanks to everyone who answered (the guy with the (.*) excluded - jokes are great , but this one was sssssssssoooo lame :) )

In addition to feeding the learning algorithm examples of "good" input, you could feed it "bad" input so it would know what not to look for. No letters in a phone number, for example.

Node.js Emoji Parsing

I'm trying to parse an incoming string to determine whether it contains any non-emojis.
I've gone through this great article by Mathias and am leveraging both native punycode for the encoding / decoding and regenerate for the regex generation. I'm also using EmojiData to get my dictionary of emojis.
With that all said, certain emojis continue to be pesky little buggers and refuse to match. For certain emoji, I continue to get a pair of code points.
// Example of a single code point:
console.log(punycode.ucs2.decode('💩'));
>> [ 128169 ]
// Example of a paired code point:
console.log(punycode.ucs2.decode('⌛️'));
>> [ 8987, 65039 ]
Mathias touches on this in his article (and gives an example of punycode working around this) but even using his example I get an incorrect response:
function countSymbols(string) {
return punycode.ucs2.decode(string).length;
}
console.log(countSymbols('💩'));
>> 1
console.log(countSymbols('⌛️'));
>> 2
What is the best way to detect whether a string contains all emojis or not? This is for a proof of concept so the solution can be as brute force as need be.
--- UPDATE ---
A little more context on my pesky emoji above.
These are visually identical but in fact different unicode values (the second one is from the example above):
⌛ // \u231b
⌛️ // \u231b\ufe0f
The first one works great, the second does not. Unfortunately, the second version is what iOS seems to use (if you copy and paste from iMessage you get the second one, and when receiving a text from Twilio, same thing).

The U+FE0F is not a combining mark, it's a variation sequence that controls the rendering of the glyph (see this answer). Removing such sequences may change the appearance of the character, for example: U+231B+U+FE0E (⌛︎).
Also, emoji sequences can be made from multiple code points. For example, U+0032 (2) is not an emoji by itself, but U+0032+U+20E3 (2⃣) or U+0032+U+20E3+U+FE0F (2⃣️) is—but U+0041+U+20E3 (A⃣) isn't. A complete list of emoji sequences are maintained in the emoji-data.txt file by the Unicode Consortium (the emoji-data-js library appears to have this information).
To check if a string contains emoji characters, you will need to test if any single character is in emoji-data.txt, or starts a substring for a sequence in it.

If, hypothetically, you know what non-emoji characters you expect to run into, you can use a little lodash magic via their toArray or split modules, which are emoji aware. For example, if you want to see if a string contains alphanumeric characters, you could write a function like so:
function containsAlphaNumeric(string){
return _(string).toArray().filter(function(char){
return char.match(/[a-zA-Z0-9]/);
}).value().length > 0 ? true : false;
}

Globalize.js not formating the date only in German culture

I have used Globalize.js to localize and format the date. it all works fine in different culture, but not working properly in German culture (de-DE). Code i have used to format.
Globalize.format(new Date(), "MM/yy/dd","de-DE");
it returns "10.14.01". i expecting the value as "10/14/01".
what might be the problem. is that issue in globalize? please anyone help me to come out of this headbang.
finally i found the cause of the problem. In globalize.culture.de-DE culture file
calendars: {
standard: {
"/": ".",
firstDay: 1,
....
.....
}
some standard has been handled like above. could any help me about why this code block has been used?

The culture de-De is German, use nl-NL instead.

It seems that you are using the old version of Globalize.js, which works rather well but isn’t developed any more, and it can be difficult to find documentation of it except in my book.
The rules for the format argument are somewhat obscure, but when a format like "MM/yy/dd" does not work, put any characters that should appear “as is” inside Ascii apostrophes, in this case
"MM'/'yy'/'dd"
Some punctuation characters can be used inside the format string without such quoting, but when in doubt, quote.

Partial Regexp Match in Javascript

I have a long regex that is generated to match URLs like
/^\/([^\/.?]+)(?:\/([^\/.?]+)(?:\/([^\/.?]+)(?:\.([^\/.?]+))?)?)?$/
Would match:
/foo/bar/1.html
as ['foo', 'bar', '1', 'html']
In Javascript I would like to get the parts that match as the user types the url (like a typeahead). For example if they typed:
/foo
It would tell me that /foo was matched, but the whole regexp hasn't been satisfied. Ruby can return an array with only the matching partial elements like : ['foo', nil, nil, nil] is this possible, or easy to do in Javascript?

#minitech basically gave half the answer: use ? after each group, and then you'll be able to match the regex even if they're missing. Once you can do that, then just check the groups of the regex result to see which bits have been matched and which haven't.
For example:
/^\/([^\/.?]+)?(?:\/([^\/.?]+)?(?:\/([^\/.?]+)?(?:\.([^\/.?]+))?)?)?$/.exec('/ab/c')
Would return:
["/ab:c", "ab:c", "c", undefined, undefined]
By checking and seeing that the fourth value returned is undefined, you could then figure out which chunks were/were not entered.
As a side note, if you're going to be working lots of regexs like this, you can easily lose your sanity just trying to keep track of which group is which. For this reason I strongly recommend using "named group" regular expressions. These are otherwise normal regular expressions that you can create if you use the XRegxp library (http://xregexp.com/), like so:
var result = XRegExp.exec('/ab/c', /^\/(?<fooPart>[^\/.?]+)?(?<barPart>?:\/([^\/.?]+)?(?:\/([^\/.?]+)?(?:\.([^\/.?]+))?)?)?$/)
var fooPart = result.fooPart
That library also has other handy features like comments that can similarly help keep regular expression under control. If you're only using this one regex it's probably overkill, but if you're doing lots of JS regexp work I can't recommend that library enough.

We Keep Coding

JavaScript is the programming language of the Web.

Language Code Stripping Regex - javascript

Related

Parse JS code for comments

Generating regex programmatically [duplicate]

Node.js Emoji Parsing

Globalize.js not formating the date only in German culture

Partial Regexp Match in Javascript

Categories

Resources