How do I "normalise" caps & spaces?
Example: coreControllerC4a automatically turns into Core Controller C4a when function is called?
Here is a starting example
"coreControllerC4a".split(/(?=[A-Z])/)
which results in
["core", "Controller", "C4a"]
Of course, there are many counter-examples, and this doesn't cover Unicode, but it does show using a regular expression to split a string. (Note the use of a look-ahead to avoid consuming any data at the split.) Now it's just a matter of transforming ("map"ing) each element to the correct case.
Happy coding.
Related
I have an application which uses a Javascript-based rules engine. I need a way to convert regular straight quotes into curly (or smart) quotes. It’d be easy to just do a string.replace for ["], only this will only insert one case of the curly quote.
The best way I could think of was to replace the first occurrence of a quote with a left curly quote and every other one following with a left, and the rest right curly.
Is there a way to accomplish this using Javascript?
You could replace all that preceed a word character with the left quote, and all that follow a word character with a right quote.
str = str.replace(/"(?=\w|$)/g, "“");
str = str.replace(/(?<=\w|^)"/g, "”"); // IF the language supports look-
// behind. Otherwise, see below.
As pointed out in the comments below, this doesn't take punctuation into account, but easily can:
/(?<=[\w,.?!\)]|^)"/g
[Edit:] For languages that don't support look-behind, like Javascript, as long as you replace all the front-facing ones first, you have two options:
str = str.replace(/"/g, "”"); // Replace the rest with right curly quotes
// or...
str = str.replace(/\b"/g, "”"); // Replace any quotes after a word
// boundary with right curly quotes
(I've left the original solution above in case this is helpful to someone using a language that does support look-behind)
You might want to look at what Pandoc does—apparently with the --smart option, it handles quotes properly in all cases (including e.g. ’tis and ’twere).
I recently wrote a Javascript typography prettification engine that does, among other things, quote replacement; I wound up using basically the algorithm suggested by Renesis, but there’s currently a failing test up waiting for a smarter solution.
If you’re interested in cribbing my code (and/or submitting a patch based on work you’ve done), check it out: jsPrettify. jsprettify.prettifyStr does what you’re looking for. If you don’t want to deal with the Closure dependency, there’s an older version that runs on its own—it even works in Rhino.
'foo "foo bar" "bar"'.replace(/"([-a-zA-Z0-9 ]+)"/g, function(wholeMatch, m1){
return "“" + m1 + "”";
});
The following just changes every quote by alternating (this specific example however would leave out the orphaned quotes).
str.replace(/\"([^\"]*)\"/gi,"“$1”");
Works perfectly, as long as the text you're texturizing isn't already screwed up with improper use of the double quote. In English, quotes are never nested.
I don't think something like that in general is easy at all, because you'd have to interpret exactly what each double-quote character in your content means. That said, what I'd do is collect all the text nodes I was interested in, and then go through and keep track of the "on/off" (or "odd/even"; whatever) nature of each double quote instance. Then you can know which replacement entity to use.
I didn't find the logic I wanted here, so here's what I ended up going with.
value = value.replace(/(^|\s)(")/g, "$1“"); // replace quotes that start a line or follow spaces
value = value.replace(/"/g, "”"); // replace rest of quotes with the back smart quote
I have a small textarea that I need to replace straight quotes with curly (smart) quotes. I'm just executing this logic on keyup. I tried to make it behave like Microsoft Word.
Posting for posterity.
As suggested by #Steven Dee, I went to Pandoc.
I try to use a mature and tested tool whenever I can versus baking my own regex. Hand built regex's can be overly greedy, or not greedy enough, and they may not be sensitive to word boundaries and commas etc. Pandoc accounts for most this and more.
From the command line (the --smart parameter turns on smart quotes):
pandoc --smart --standalone -o output.html input.html
..and I know a command line script may or may not fit OP's requirement of using Javascript. (related: How to execute shell command in Javascript)
While writing an API service for my site, I realized that String.split() won't do it much longer, and decided to try my luck with regular expressions. I have almost done it but I can't find the last bit. Here is what I want to do:
The URL represents a function call:
/api/SECTION/FUNCTION/[PARAMS]
This last part, including the slash, is optional. Some functions display a JSON reply without having to receive any arguments. Example: /api/sounds/getAllSoundpacks prints a list of available sound packs. Though, /api/sounds/getPack/8Bit prints the detailed information.
Here is the expression I have tried:
req.url.match(/\/(.*)\/(.*)\/?(.*)/);
What am I missing to make the last part optional - or capture it in whole?
This will capture everything after FUNCTION/ in your URL, independent of the appearance of any further / after FUNCTION/:
FUNCTION\/(.+)$
The RegExp will not match if there is no part after FUNCTION.
This regex should work by making last slash and part after optional:
/^\/[^/]*\/[^/]*(?:\/.*)?$/
This matches all of these strings:
/api/SECTION/FUNCTION/abc
/api/SECTION
/api/SECTION/
/api/SECTION/FUNCTION
Your pattern /(.*)/(.*)/?(.*) was almost correct, it's just a bit too short - it allows 2 or 3 slashes, but you want to accept anything with 3 or 4 slashes. And if you want to capture the last (optional) slash AND any text behind it as a whole, you simply need to create a group around that section and make it optional:
/.*/.*/.*(?:/.+)?
should do the trick.
Demo. (The pattern looks different because multiline mode is enabled, but it still works. It's also a little "better" because it won't match garbage like "///".)
I'm looking for a regex to match numeral pinyin lexical unit (one or more pinyin without space).
Reading Regex for Matching Pinyin seems a good start as I was able to quickly add the support for numeral by doing :
/(ORIGINAL_REGEXP)[0-5]/
So essentially wrapping the old regexp in a group and appending the numeral condition.
However I'm not able to extend this to the case of multiple words. For instance :
jiao4zuo4zhi1wu4 叫座之物
jiao4zu3 教祖
jiao4zong1xuan3ju3 教宗选举
jiao4zi3 教子
jiao4zhun3yi2qi4 校准仪器
jiao4zhun3tiao2 校准条
jiao4zhun3ti1chi3 校准梯尺
jiao4zhun3quan1 校准圈
jiao4zhun3qi4 校准器
jiao4zhun3pu3 校准谱
N.B.: This expression will be used in a Javascript context.
I might be interpreting your question the wrong way, but couldn't you just add a + for one or more pinyins? I.e.
/((ORIGINAL_REGEXP)[0-5])+/
Here is the regexp I'm using based on #EagleV_Attnam solution and some addition what I've done fin:
/^((ORIGINAL_REGEXP)[0-5])+$/
The addition of the start ^ and end $ anchor solve my issues :)
Full regex is:
/^((([mM]iu|[pmPM]ou|[bpmBPM](o|e(i|ng?)?|a(ng?|i|o)?|i(e|ng?|a[no])?|u))|([fF](ou?|[ae](ng?|i)?|u))|([dD](e(i|ng?)|i(a[on]?|u))|[dtDT](a(i|ng?|o)?|e(i|ng)?|i(a[on]?|e|ng|u)?|o(ng?|u)|u(o|i|an?|n)?))|([nN]eng?|[lnLN](a(i|ng?|o)?|e(i|ng)?|i(ang|a[on]?|e|ng?|u)?|o(ng?|u)|u(o|i|an?|n)?|ve?))|([ghkGHK](a(i|ng?|o)?|e(i|ng?)?|o(u|ng)|u(a(i|ng?)?|i|n|o)?))|([zZ]h?ei|[czCZ]h?(e(ng?)?|o(ng?|u)?|ao|u?a(i|ng?)?|u?(o|i|n)?))|([sS]ong|[sS]hua(i|ng?)?|[sS]hei|[sS][h]?(a(i|ng?|o)?|en?g?|ou|u(a?n|o|i)?|i))|([rR]([ae]ng?|i|e|ao|ou|ong|u[oin]|ua?n?))|([jqxJQX](i(a(o|ng?)?|[eu]|ong|ng?)?|u(e|a?n)?))|(([aA](i|o|ng?)?|[oO]u?|[eE](i|ng?|r)?))|([wW](a(i|ng?)?|o|e(i|ng?)?|u))|[yY](a(o|ng?)?|e|in?g?|o(u|ng)?|u(e|a?n)?))[0-5])+$/
For example replace the string Yangomo, Congo, DRC with Yangomo, Congo, <acronym>DRC</acronym>. There may potentially be mulitple uppercase substings in each string. I assume some form of regex?
Thanks.
Well, a really simple one might be:
var replaced = original.replace(/\b([A-Z]+)\b/g, '<acronym>$1</acronym>');
Doing this sort of thing always has complications, however; it depends on the source material. (The "\b" thing matches word boundaries, and is an invaluable trick for all sorts of occasions.)
edit — insightful user Buh Buh points out that it might be nice to only affect strings with more than two characters, which would look like /\b([A-Z]{2,})\b/.
Personally I would use PHP to explode the string, use a regex to find all uppercase letters /[A-Z]+/ and then use PHP to insert the tags (using str_replace).
I am hoping that this will have a pretty quick and simple answer. I am using regular-expressions.info to help me get the right regular expression to turn URL-encoded, ISO-8859-1 pound sign ("%A3"), into a URL-encoded UTF-8 pound sign ("%C2%A3").
In other words I just want to swap %A3 with %C2%A3, when the %A3 is not already prefixed with %C2.
So I would have thought the following would work:
Regular Expression: (?!(\%C2))\%A3
Replace With: %C2%A3
But it doesn't and I can't figure out why!
I assume my syntax is just slightly wrong, but I can't figure it out! Any ideas?
FYI - I know that the following will work (and have used this as a workaround in the meantime), but really want to understand why the former doesn't work.
Regular Expression: ([^\%C2])\%A3
Replace With: $1%C2%A3
TIA!
Why not just replace ((%C2)?%A3) with %C2%A3, making the prefix an optional part of the match? It means that you're "replacing" text with itself even when it's already right, but I don't foresee a performance issue.
Unfortunately, the (?!) syntax is negative lookahead. To the best of my knowledge, JavaScript does not support negative lookbehind.
What you could do is go forward with the replacement anyway, and end up with %C2%C2%A3 strings, but these could easily be converted in a second pass to the desired %C2%A3.
You could replace
(^.?.?|(?!%C2)...)%A3
with
$1%C2%A3
I would suggest you use the functional form of Javascript String.replace (see the section "Specifying a function as a parameter"). This lets you put arbitrary logic, including state if necessary, into a regexp-matching session. For your case, I'd use a simpler regexp that matches a superset of what you want, then in the function call you can test whether it meets your exact criteria, and if it doesn't then just return the matched string as is.
The only problem with this approach is that if you have overlapping potential matches, you have the possibility of missing the second match, since there's no way to return a value to tell the replace() method that it isn't really a match after all.