Searching all emails over large document [duplicate]

Searching all emails over large document [duplicate] - javascript

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?

To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.

I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.

They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.

They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"

Related

The reason to use "-" instead of "_" in the HTML class and id attributes [duplicate]

In the past I've always used underscores for defining class and id attributes in HTML. Over the last few years I changed over to dashes, mostly to align myself with the trend in the community, not necessarily because it made sense to me.
I've always thought dashes have more drawbacks, and I don't see the benefits:
Code completion & Editing
Most editors treat dashes as word separators, so I can't tab through to the symbol I want. Say the class is "featured-product", I have to auto-complete "featured", enter a hyphen, and complete "product".
With underscores "featured_product" is treated as one word, so it can be filled in one step.
The same applies to navigating through the document. Jumping by words or double-clicking on class names is broken by hyphens.
(More generally, I think of classes and ids as tokens, so it doesn't make sense to me that a token should be so easily splittable on hyphens.)
Ambiguity with arithmetic operator
Using dashes breaks object-property access to form elements in JavaScript. This is only possible with underscores:
form.first_name.value='Stormageddon';
(Admittedly I don't access form elements this way myself, but when deciding on dashes vs underscores as a universal rule, consider that someone might.)
Languages like Sass (especially throughout the Compass framework) have settled on dashes as a standard, even for variable names. They originally used underscores in the beginning too. The fact that this is parsed differently strikes me as odd:
$list-item-10
$list-item - 10
Inconsistency with variable naming across languages
Back in the day, I used to write underscored_names for variables in PHP, ruby, HTML/CSS, and JavaScript. This was convenient and consistent, but again in order to "fit in" I now use:
dash-case in HTML/CSS
camelCase in JavaScript
underscore_case in PHP and ruby
This doesn't really bother me too much, but I wonder why these became so misaligned, seemingly on purpose. At least with underscores it was possible to maintain consistency:
var featured_product = $('#featured_product'); // instead of
var featuredProduct = $('#featured-product');
The differences create situations where we have to translate strings unnecessarily, along with the potential for bugs.
So I ask: Why did the community almost universally settle on dashes, and are there any reasons that outweigh underscores?
There is a related question from back around the time this started, but I'm of the opinion that it's not (or shouldn't have been) just a matter of taste. I'd like to understand why we all settled on this convention if it really was just a matter of taste.

Code completion
Whether dash is interpreted as punctuation or as an opaque identifier depends on the editor of choice, I guess. However, as a personal preference, I favor being able to tab between each word in a CSS file and would find it annoying if they were separated with underscore and there were no stops.
Also, using hyphens allows you to take advantage of the |= attribute selector, which selects any element containing the text, optionally followed by a dash:
span[class|="em"] { font-style: italic; }
This would make the following HTML elements have italic font-style:
<span class="em">I'm italic</span>
<span class="em-strong">I'm italic too</span>
Ambiguity with arithmetic operator
I'd say that access to HTML elements via dot notation in JavaScript is a bug rather than a feature. It's a terrible construct from the early days of terrible JavaScript implementations and isn't really a great practice. For most of the stuff you do with JavaScript these days, you'd want to use CSS Selectors for fetching elements from the DOM anyway, which makes the whole dot notation rather useless. Which one would you prefer?
var firstName = $('#first-name');
var firstName = document.querySelector('#first-name');
var firstName = document.forms[0].first_name;
I find the two first options much more preferable, especially since '#first-name' can be replaced with a JavaScript variable and built dynamically. I also find them more pleasant on the eyes.
The fact that Sass enables arithmetic in its extensions to CSS doesn't really apply to CSS itself, but I do understand (and embrace) the fact that Sass follows the language style of CSS (except for the $ prefix of variables, which of course should have been #). If Sass documents are to look and feel like CSS documents, they need to follow the same style as CSS, which uses dash as a delimiter. In CSS3, arithmetic is limited to the calc function, which goes to show that in CSS itself, this isn't an issue.
Inconsistency with variable naming across languages
All languages, being markup languages, programming languages, styling languages or scripting languages, have their own style. You will find this within sub-languages of language groups like XML, where e.g. XSLT uses lower-case with hyphen delimiters and XML Schema uses camel-casing.
In general, you will find that adopting the style that feels and looks most "native" to the language you're writing in is better than trying to shoe-horn your own style into every different language. Since you can't avoid having to use native libraries and language constructs, your style will be "polluted" by the native style whether you like it or not, so it's pretty much futile to even try.
My advice is to not find a favorite style across languages, but instead make yourself at home within each language and learn to love all of its quirks. One of CSS' quirks is that keywords and identifiers are written in lowercase and separated by hyphens. Personally, I find this very visually appealing and think it fits in with the all-lowercase (although no-hyphen) HTML.

Perhaps a key reason why the HTML/CSS community aligned itself with dashes instead of underscores is due to historical deficiencies in specs and browser implementations.
From a Mozilla doc published March 2001 # https://developer.mozilla.org/en-US/docs/Underscores_in_class_and_ID_Names
The CSS1 specification, published in its final form in 1996, did not
allow for the use of underscores in class and ID names unless they
were "escaped." An escaped underscore would look something like this:
p.urgent\_note {color: maroon;}
This was not well supported by browsers at the time, however, and the
practice has never caught on. CSS2, published in 1998, also forbade
the use of underscores in class and ID names. However, errata to the
specification published in early 2001 made underscores legal for the
first time. This unfortunately complicated an already complex
landscape.
I generally like underscores but the backslash just makes it ugly beyond hope, not to mention the scarce support at the time. I can understand why developers avoided it like the plague. Of course, we don't need the backslash nowadays, but the dash-etiquette has already been firmly established.

I don't think anyone can answer this definitively, but here are my educated guesses:
Underscores require hitting the Shift key, and are therefore harder to type.
CSS selectors which are part of the official CSS specifications use dashes (such as pseudo-classes like :first-child and pseudo-elements :first-line), not underscores. Same thing for properties, e.g. text-decoration, background-color, etc. Programmers are creatures of habit. It makes sense that they would follow the standard's style if there's no good reason not to.
This one is further out on the ledge, but... Whether it's myth or fact, there is a longstanding idea that Google treats words separated by underscores as a single word, and words separated by dashes as separate words. (Matt Cutts on Underscores vs. Dashes.) For this reason, I know that my preference now for creating page URLs is to use-words-with-dashes, and for me at least, this has bled into my naming conventions for other things, like CSS selectors.

There are many reasons, but one of the most important thing is maintaining consistency.
I think this article explains it comprehensively.
CSS is a hyphen-delimited syntax. By this I mean we write things like font-size, line-height, border-bottom etc.
So:
You just shouldn’t mix syntaxes: it’s inconsistent.

There's been a clear uptick in hyphen-separated, whole-word segments of URLs over recent years. This is encouraged by SEO best practices. Google explicitly "recommend that you use hyphens (-) instead of underscores (_) in your URLs": http://www.google.com/support/webmasters/bin/answer.py?answer=76329.
As noted, different conventions have prevailed at different times in different contexts, but they typically are not a formal part of any protocol or framework.
My hypothesis, then, is that Google's position anchors this pattern within one key context (SEO), and the trend to use this pattern in class, id, and attribute names is simply the herd moving slowly in this general direction.

I think it's a programmer dependent thing. Someones like to use dashes, others use underscores.
I personally use underscores (_) because I use it in other places too. Such as:
- JavaScript variables (var my_name);
- My controller actions (public function view_detail)
Another reason that I use underscores, is this that in most IDEs two words separated by underscores are considered as 1 word. (and are possible to select with double_click).

point of refactoring only btn to bt
case: btn_pink
search btn in word
result btn
case: btn-pink
search btn in word
result btn | btn-pink
case: btn-pink
search btn in regexp
\bbtn\b(?!-) type to hard
result btn

How to check whether 2 regexp are the same accounting syntax differences?

I'm refactoring a rather large RegExp into a function that returns a RegExp. As a backward-compatibility test, I compared the .source of the returned RegExp with the .source of the old RegExp:
getRegExp(/* in the case requiring backward compatibility there's no arguments */)
.source == oldRegExp.source
However, I've noticed that the old RegExp contains various excessive backslashes like [\.\w] instead of [.\w]. I'd like to refactor such bits, but there's a number of them and it would be nice to have a similar check (backward compability is not broken). The problem is, /[\.\w]/.source != /[.\w]/.source. And identifying which backslashes may be removed automatically is not trivial (\. and . are not the same outside [...] and may be in some other cases).
Are you aware of somewhat simple ways to do so? It seems this can only be done by actual parsing of the .source (compare the example above with /\[\.\w]\/ and /\[.\w]\/), but may be I'm missing some trick of utilizing browser's built-in properties/methods. The point is, '\"' == '"' is true, so strings defined with these different syntaxes are stored as "normalized" values ("), I wonder if such "normalized" pattern is available for a RegExp.

Sadly, comparing two regular expressions to see if they're the same is exactly the same as comparing any other two pieces of code - ie, hard.
The only real way I know of to do this is to create a suite of tests, each one targeting a specific aspect of the regular expression and verifying that it works properly. This is not an easy process-regular expressions are subtle and complex with a lot of potential for unrealized side effects. I recently had to fix some defects in a regex based address parser and it took about a thousand unit tests before I was satisfied with my coverage... but then as soon as I started to change the regex MY TESTS CAUGHT STUFF CONSTANTLY!!
Unit testing sucks and it's just tiring and not fun, but for almost any piece of logic it has real value, and when using powerful tools like regex, I would say it's absolutely crucial.

Comma-separated list of integers with max 5 elements [duplicate]

I've seen regex patterns that use explicitly numbered repetition instead of ?, * and +, i.e.:
Explicit Shorthand
(something){0,1} (something)?
(something){1} (something)
(something){0,} (something)*
(something){1,} (something)+
The questions are:
Are these two forms identical? What if you add possessive/reluctant modifiers?
If they are identical, which one is more idiomatic? More readable? Simply "better"?

To my knowledge they are identical. I think there maybe a few engines out there that don't support the numbered syntax but I'm not sure which. I vaguely recall a question on SO a few days ago where explicit notation wouldn't work in Notepad++.
The only time I would use explicitly numbered repetition is when the repetition is greater than 1:
Exactly two: {2}
Two or more: {2,}
Two to four: {2,4}
I tend to prefer these especially when the repeated pattern is more than a few characters. If you have to match 3 numbers, some people like to write: \d\d\d but I would rather write \d{3} since it emphasizes the number of repetitions involved. Furthermore, down the road if that number ever needs to change, I only need to change {3} to {n} and not re-parse the regex in my head or worry about messing it up; it requires less mental effort.
If that criteria isn't met, I prefer the shorthand. Using the "explicit" notation quickly clutters up the pattern and makes it hard to read. I've worked on a project where some developers didn't know regex too well (it's not exactly everyone's favorite topic) and I saw a lot of {1} and {0,1} occurrences. A few people would ask me to code review their pattern and that's when I would suggest changing those occurrences to shorthand notation and save space and, IMO, improve readability.

I can see how, if you have a regex that does a lot of bounded repetition, you might want to use the {n,m} form consistently for readability's sake. For example:
/^
abc{2,5}
xyz{0,1}
foo{3,12}
bar{1,}
$/x
But I can't recall ever seeing such a case in real life. When I see {0,1}, {0,} or {1,} being used in a question, it's virtually always being done out of ignorance. And in the process of answering such a question, we should also suggest that they use the ?, * or + instead.
And of course, {1} is pure clutter. Some people seem to have a vague notion that it means "one and only one"--after all, it must mean something, right? Why would such a pathologically terse language support a construct that takes up a whole three characters and does nothing at all? Its only legitimate use that I know of is to isolate a backreference that's followed by a literal digit (e.g. \1{1}0), but there are other ways to do that.

They're all identical unless you're using an exceptional regex engine. However, not all regex engines support numbered repetition, ? or +.
If all of them are available, I'd use characters rather than numbers, simply because it's more intuitive for me.

They're equivalent (and you'll find out if they're available by testing your context.)
The problem I'd anticipate is when you may not be the only person ever needing to work with your code.
Regexes are difficult enough for most people. Anytime someone uses an unusual syntax, the question
arises: "Why didn't they do it the standard way? What were they thinking that I'm missing?"

difference with Kleen regex expression

just started to appreciate regex and I am practicing it on regexone.com
my question is given the explanation about kleene "*". I came up with an answer on my own"
[a-c]*
but the solution is:
aa+b*c+ or a*b*c*
is there any differences in terms of behavior with the two? especially if I use it with javascript?
sorry for my bad english.

The problem is insufficiently defined, as there are no negative examples.
For example, if they ask you in medical school "what is the name of the device that amputates", "a car" is technically correct, but probably not what they wanted to hear (as a number of car accidents end up with people with cut off limbs). But had the question been "What is the name of the instrument a medical professional would use to perform an amputation during surgery", the answer can't be "a car" any more.
Similarly, your solution will work for all provided cases, but it is not as precise as theirs. For example, "cba" is recognised by your expression, but is rejected by theirs (at least not as a match of the whole string; a*b*c* trivially matches "cba" as a 0-length match anywhere in the string, and as a 1-length match of the "a" bit). For that matter, .* is also a valid (but completely imprecise) solution to their problem.

There is no difference in the general regex behavior in JavaScript and in other languages. The task basically requires that you give the most restrictive regex that matches the patterns provided. They also provide an alternative answer to show that you can match the pattern with a less restrictive regex as well. The things to look for with the provided patterns are the following:
There are always pairs of a-s: aa, aaaa
There is 0 or more b-s: b, bbbb, no b
There is 1 or more c at the end: cc, c, cc
The letters always come in this order: abc
There is a lot of regex you can come up in order to match these four conditions so you would need to attempt providing the most restrictive one in order to minimize the matches outside of these examples. Still even with the provided answer you will match infinitely many other strings.
An even more restrictive regex would be:
^(aa)+b*c+$
Here the regex requires that the string starts with aa and ends with a c. I assume that the lessons still haven't gotten to ^ and $ and thus the answer provided does not include these.

How To Create This RegExp

I am looking to find this in a string: XXXX-XXX-XXX Where the X is any number.
I need to find this in a string using JavaScript so bonus points to those who can provide me the JavaScript too. I tried to create a regex and came out with this: ^[0-9]{4}\-[0-9]{3}\-[0-9]{3}$
Also, I would love to know of any cheat sheets or programs you guys use to create your regular expressions.

i suppose this is what you want:
\d{4}-\d{3}-\d{3}
in doubt? Google for "RegEx Testers"

With your attempt:
^[0-9]{4}\-[0-9]{3}\-[0-9]{3}$
Since the - is not a metacharacter, there is no need to escape it -- thus you are looking for explicit backslash characters.
Also, you've anchored the match at the beginning and end of the string -- this will match only strings that consist only of your number. (Well, assuming the rest were correct.)
I know most people like the {3} style of counting, but when the thing being matched is a single digit, I find this more legible:
\d\d\d\d-\d\d\d-\d\d\d
Obviously if you wanted to extend this to matching hexadecimal digits, extending this one would be horrible, but I think this is far more legible than alternatives:
\d{4}-\d{3}-\d{3}
[[:digit:]]{4}-[[:digit:]]{3}-[[:digit:]]{3}
[0-9]{4}-[0-9]{3}-[0-9]{3}
Go with whatever is easiest for you to read.
I tend to use the perlre(1) manpage as my main reference, knowing full well that it is far more featureful than many regexp engines. I'm prepared to handle the differences considering how conveniently available the perlre manpage is on most systems.

var result = (/\d{4}\-\d{3}\-\d{3}/).exec(myString);

We Keep Coding

JavaScript is the programming language of the Web.