Regular Expression for an entire text - javascript

I have been working on a regular expression to divide the one text in sentences. But I been having problems with numbers like 13.4 or emails. In reasons of the '.'. Someone would how to fix it?
/([^\n\r\.!\?:;]+[\.!\?:;]\s)|([^\.!\?]+$)/g

As #WiktorStribizew said, this is a too hard task to be completed with just regex. But you can still get an approximation matching interpunction followed by spaces or new lines:
/.*?[.!?:;](\s|[\r\n]|$)/gm
You can see an example here.

Related

[Javascript][regex] Why does pipe character is shortlisted in my regex which uses pipe character as a alternative [duplicate]

Using the following regex with the ExtJS library, the intention is to only allow spaces, dollar signs, underscores, alpha and numeric characters. However, for some reason, the vertical bar/ pipe character is allowed as well. I hope someone can tell me what I am missing here. Am I inadvertently escaping one of the vertical bars?
maskRe:/^[a-z|A-Z|0-9|$|_ ]$/
Thank you kindly for your time!
You don't need the vertical bars inside your character class. Try the following instead:
maskRe:/^[a-zA-Z0-9$_ ]$/

Split a long string into chunks using REGEX 'Lookbehind' as optional

I'm working on a regex that lets me split into chunks a long text that could have #variables# inside. The rules to do the splitting basically are:
Split by each #photo# or #childphoto# variable and look behind or
ahead for text to don't cut the sentence.
Each chunk should have only one #photo# or #childphoto# variable, or not have any of these variables
Also, the chunk should be less than 350 characters
The chunk should not have to cut words or sentences
The chunk should not have to cut any of the possible text variables into the text #anyOtherVariables#
Currently, I have this Regex
/^.*[\S\s]{0,350}[\s\S](?<=(#photo#|#childphoto#)).*/
That currently is working with the .match() JavaScript method to extract the chunks of text that have the variables using the 'look behind' approach, but is not working with the other chunks that do not match the 'look behind' condition, is there a way to include the other parts?
There are the regexp and the study test case. https://regex101.com/r/kdKHkQ/1
I will really appreciate any help with that.
Here is a single JavaScript regex that does what you have specified:
^\b(?=([^]*))[^]{0,350}$(?<=(?![^]{1,}\1$)(?:#(photo|childphoto)#)?[^]*?)(?<!(?=\1$)(?:[^]*?#(photo|childphoto)#){2}[^]*?)
Demo on regex101
It enforces the 350 character limit by taking a snapshot (using lookahead) at the beginning, consuming and capturing up to 350 characters, and then using a lookbehind to look no further back than the snapshotted beginning, to assert that one of the variables in question is inside the just-captured string. Then it uses a negative lookbehind to enforce that there are not two or more of the variables in question in the just-captured string.
I did not understand your rule "The chunk should not have to cut any of the possible text variables into the text #anyOtherVariables#". If by that you mean that the lines containing variables other than #photo# or #childphoto# should be skipped over (not matched), then this regex does not do that, but it could be easily modified to do so.
Now, practically speaking, it would probably be better to implement this in code, or a combination of code and regex, but this demonstrates that exactly what you asked is possible with a pure regex.
I would like to point out that calling this "splitting by each #photo# or #childphoto# variable" is disingenous, and if I actually took that literally, it would be breaking your other rule, that the chunk should not cut sentences. That is probably why you got downvoted.
I'm posting my answer here, despite the fact that you got downvoted, because I already answered this on reddit and you disappeared without commenting.

Improvement of JS Regex to restrict all letters of a word in a specific range

I'm solving the Ranges challenge in RegexGolf, but I'm somewhat stuck in trying to shorten the regex.
Here is a screenshot of the conditions -
My current solution is \b[a-f]+\b. This pattern has the required range [a-f] in a word boundary. While this works, the regex has 10 characters, and the result list shows submissions with 8, and even 1 character.
Would appreciate any insights on improving this regex.
First please note that shorter doesn't necessarily means better, faster or better readable. But as this is a golfing challenge:
This site seems to handle every input as a separate string. While the word boundaries you are using are fine, using start and end of string anchors (^ and $) will be 1 character shorter each. I don't see how it could be minimized further, so your regex could be
^[a-f]+$
Note: One of the 1-score solutions comments, that i dont know regex but i know javascript, so I'd guess that there was some cheating involved.

Formatting a number to specific format via Regex in JavaScript

I'm struggling to get a regex working as expected... this is specifically on an ExtJS textfield via regex attribute, but I don't believe it matters, should be a generic JavaScript/Regex question.
Basically, given a string:
1112223334.56
...I want that to be invalid because there's more than 9 digits to the left of the decimal. I've come up with the following regex:
/^(\d{0,9}.{0,1}\d{0,2})$/
That covers almost all the bases: it properly rejects more than one decimal point, or if there's more than two digits to the right of the decimal... and it properly ACCEPTS if there's no decimal portion, or less than 9 digits to the right. So that's all as it should be. The one case it's NOT rejecting that I need it to is more than 9 digits to the left of the decimal.
I've never been a fan of regex so I've been struggling with this for a while, even though I suspect it's a simple answer. Can anyone point out my stupidity? :) Thanks!
In regex, the '.' is a special character that actually means 'any character'. So, you need to escape it using a backslash:
^(\d{0,9}\.{0,1}\d{0,2})$

Remove Jargon but keep real characters

I"m getting bombarded by spam with posts like below, so what would be the best and most efficient way of remove all the jargon from something like this:
<texarea id="comment">ȑ̉̽ͧ̔͆ͦ̊͛̿͗҉̷̢̧̫̗̗͎͈͕e̷̪͓̼̼̣̻̻͙͔̳̘̗͙̬̱͎ͭ̃͗ͩͯͥͬ̂ͧ͐͌̑̅͢͜ͅd̴̦̺̖̣͎̲̥͕̗̺̯̤͗ͬ͌ͧ̓͒ͭ́̋ͩͥ͊̇̓̌ͫ̃́́͠</textarea>
I'm assuming RegEx, but what exactly are those things called and how would it be referenced in RegExp? The problem lays within a <textarea> tag, and upon retrieving the value, I'd like to be able to remove all that jargon from the value and have it only display the real characters which in this case should be red.
Allowing other Unicode type of characters are essential, but not characters that stack on top of each other.
Zalgo waits behind the wall.
You want to filter out combining characters, such as the diacritical marks listed here.
You should be able to get away with a simple character class pattern match, i.e.:
fooString.replace(/[\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]/, "");
If you want to limit content to one combination per character (not that this really alleviates all negative side-effects), you could simply use
fooString.replace(/([\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f])[\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]*/, "$1");
EDIT: Added a number of other combining character ranges. This is most likely still not exhaustive.
Removing combining diacriticals will make input of some languages (such as Vietnamese) difficult or impossible, so you should reconsider.

Categories