Complex Regex composition - Regex that match "if" - javascript

I'm making a Regex to match hashtags to my project. I want that regex match hashtags that are separeted by one single space, don't have another hashtag inside this content and just match a space in the string if this is followed by any word (except other blank space or #).
I'm really curious to know if I can do something like "if" in regular expressions and I hope you can help me with this.
So, in:
"#hashtag?!-=_" "#hashhash#" "#hash tag" "#hash tag" "#hash #ahuhuhhuasd" "#hash "
The regex must match the following sentences:
"#hashtag?!-=_" "#hashhash" "#hash tag" "#hash" "#hash #ahuhuhhuasd" "#hash"
(all hashtag) (one) (another h.)
Actually, this is my code:
#{1,1}\S+\s{0,1}
You can test here this code, but it matches things that isn't desired:
"#ahusdhuas?!__??###hud #ahusdhuads "
The blank space in the end of the string, the 3 '#' inside the string.
none of the following content is desired in this string, just "#ahusdhuas?!__??"
Glad if you can help me!

I think this is what you need :
(#(?:\s?[^#\s]+)+)
Here are some tests :

Is any of these are what You've been looking for?

Try:
#[^# ]+(?: [^# ]+)*
Match a #, then one or more characters that aren't # or , then 0 or more instances of ( A space followed by one or more characters that aren't # or ). The ?: makes the group non-capturing.
If you don't want to match ###hud in #ahusdhuas?!__??###hud #ahusdhuads at all because it begins with three #, you can add the negative lookbehind: (?<!#) to the front of the regex:
(?<!#)#[^# ]+(?: [^# ]+)*
However, that will work in Ruby but not in JavaScript, since JavaScript doesn't have the capability to do lookbehinds. In that case you'd have to use the #[^# ]+(?: [^# ]+)* pattern, and if the match starts after the first character, test the previous character in the string in your code to see if it is a #, and if so, reject the match the regex returns.

I think I got it, though I'm not accustomed to Javascript's regex expression because I only use Python.
I tested the following on the site regexpal.com given by Monty Wild, it's the only one that showed me all the substrings matched:
(?:^ |^| )(#[^#\s]+(?: [^#\s]+)?)(?:(?=\Z| \Z| \S)| +(?=#))
result
#hashtag?!-=_
#hash tag
#hash
#ahuhuhhuasd
#hash
As Javascript's regexex doesn't accept lookbehind assertions, I used a trick to make so that a hastag preceded by two or more blanks won't match: these preceding blanks are consumed by the regex machine as subsequent blanks in the preceding matching: that's the role of the last part +(?=#) of the regex to trihgger such a matching of trailing blanks of a matcjing if there are more than one. This cosumption intervenes only if the former part (?=\Z| \Z| \S) didn't match

Tried this in a standard HTML page and in Firebug as well ...
Works againt inputs you gave.
var hashTags = ["#hashtag?!-=_", "#hashhash#", "#hash tag", "#hash tag", "#hash #ahuhuhhuasd", "#hash ", "#hash #", "#foo bar baz"];
hashTags.forEach(function(el, idx, arr) {
console.log( el.match(/#([^#\s]|(( [^\s])(?!\s|$)))+/g));
});
// Console output
> ["#hashtag?!-=_"]
> ["#hashhash"]
> ["#hash tag"]
> ["#hash"]
> ["#hash #ahuhuhhuasd"]
> ["#hash"]
> ["#hash"]
> ["#foo bar baz"]

Related

replace '&' character with reg exp in javascript

I'd like to replace the "&" character, along with characters that may interfere with urls syntax.
so far i tried:
myText = myText.replace(/[^a-zA-Z0-9-. ]/g,'');
that probably works for other characters (didn't test it) but didn't comprehend the "&" which is what i care most about, so i added in combo the following line but also didn't get rid of the &:
myText = myText.replace(/&/g,'');
but neither work, how can i replace this special character?
SOLUTION:
Code was reading & at delivery and not &, so i had to do:
myText = myText.replace(/&/g,'');
and it works.
SNIPPET:
var text = "god & damn it";
console.log(text.replace(/&|&/g,''));
According to your comments, what you are trying to replace is this &, the html encoding of the & character.
With lodash you can _.unescape the string before replacing:
myText = _.unescape(myText).replace(/&/g, '');
This way you handle both & and & cases. Then if you have to append that text in the html you should _.escape it back to prevent weird side effects: _.escape(myText);.
Without lodash you can just search both in your regex:
myText = myText.replace(/&|&/g, '');
But this method can have it's side effects when other special characters are present because it removes the & character too, for example this string "Three is > than two & one" would end up looking like this "Three is gt; than two one" (notice the ugly gt; in the middle)
console.log("m&yText".replace(/\&/g,''))
I can suggest adding the backslash character before the & as to 'escape' using the & as the regex character. You want the regex to find and replace any literal & character.

Trying to write a regex where a newline may appear anywhere in a group

I'm trying to make a regex divide text into two parts and ignore everything that comes after these two parts.
The (insufficient) regex I'm trying to use is:
/Artikelnummer(?:(&&&))(.*)(?:\s*.*)\W?(?:Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite&&&)((.*)&&&(.*)&&&(\d)+)*/
The text I'm matching is saved at these links:
https://regex101.com/r/VDnUoe/1
https://regex101.com/r/j62Mw0/2
Part 1) Everything after Artikelnummer and before Dokumentation... (easy to match)
Part 2) Everything after (?:Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite&&&) that follows the pattern:
text&&&text&&&digits
In one of the above links, the above pattern works except for a new line that is thrown in, which causes some text to be left out that should be included.
The first part is matched:
all&&&Vorwort&&&1&&&all&&&Sicherheit&&&2&&&all&&&Richtlinien und Normen&&&3&&&all&&&Produktbeschreibung&&&4&&&all&&&Installation&&&5&&&all&&&Wichtige Informationene zur Inbetriebnahme&&&6&&&all&&&Projektierung - Wichtige Infos&&&7&&&all&&&Anhang 1&&&8&&&all&&&Anhang 2&&&9&&&all&&&Anhang 3&&&10&&&all&&&Anhang 4&&&11&&&all&&&Anhang 5&&&12&&&all&&&Anhang 6&&&13&&&all&&&Anhang 7&&&14&&&all&&&Anhang 8&&&15&&&all&&&Anhang 9&&&16&&&all&&&Anhang 10&&&17&&&all&&&Anhang 11&&&18&&&all&&&Anhang 12&&&19&&&all&&&Anhang 13&&&20&&&all&&&Anhang 14&&&21&&&all&&&Anhang 15&&&22&&&all&&&Anhang 16&&&23&&&all&&&Anhang 17&&&24&&&all&&&Anhang 18&&&25&&&all&&&Anhang 19&&&26&&&all&&&Anhang 20&&&27&&&all&&&Anhang 21&&&28&&&all&&&Anhang 22&&&29&&&all&&&Anhang 23&&&30&&&all&&&Anhang 24&&&31&&&all&&&Anhang 25&&&32&&&all&&&Anhang 26&&&33
And then this isn't matched, because a newline is inserted:
all&&&Anhang 27&&&34&&&all&&&Anhang 28&&&35&&&all&&&Anhang 29&&&36&&&all&&&Anhang 30&&&37&&&all&&&Anhang 31&&&38&&&all&&&Anhang 32&&&39&&&all&&&Anhang 33&&&40&&&all&&&Anhang 34&&&41&&&all&&&Anhang 35&&&42&&&all&&&Anhang 36&&&43&&&all&&&Anhang 37&&&44&&&all&&&Anhang 38&&&45
My question is, how can this regex be rewritten so that a newline could theoretically be placed anywhere within the second part of the text and still match everything I want?
I'm not sure this is what you want, anyway this regex works with newlines too:
Artikelnummer(?:(&&&))(.*)(?:\s*.*)\W?(?:Dokumentation&&&KKS-Nummer&&&Beschreibung&&&Seite&&&)((.*)&&&(.*)&&&(\d)+(\n?)*)*
\n matches newline
? is the quantifier for zero or one (if newline is found or not)
* I added this one if more newline are encountered
I would try a regex like this:
(Artikelnummer([\n|\r| |\S]*)(?=Dokumentation))(([\n|\r| |\S]*&&&){2}\d+)*
Looking for the \n\r and all other non space chars.
Second I wouldn't use the ?: - for maching every find. The positive lookup ?= should give you the requirements for the first group.

JavaScript's negative look-ahead doesn't work as expected?

I have some data in a textarea :
(yes it is multiline)
"#ObjectTypeID", DbType.In
"#ObjectID", DbType.Int32,
"#ClaimReasonID", DbType.I
"#ClaimReasonDetails", DbTy
"#AccidendDate", DbType.Da
"#AccidendPlaceID", DbType
"#AccidendPlaceDetails", Db
"#TypeOfMedicalTreatment",
"#MedicalTreatmentDate", Db
"#CreatedBy", DbType.Int32
"#Member_ID", DbType.Strin
.ExecuteScalar(command).ToS
In each row - I want to remove those sections : (from " (include) till the end of row) :
Visually : ( I sampled only 4 )
I've managed to do this :
value=value.replace(/\"[a-z,. ]+(?!.*\")/gi,'')
Which means : search the first " where have charters after it , which doesnot have a future "
This will yield the required results :
"#ObjectTypeID
"#ObjectID32,
"#ClaimReasonID
"#ClaimReasonDetails
"#AccidendDate
"#AccidendPlaceID
"#AccidendPlaceDetails
"#TypeOfMedicalTreatment
"#MedicalTreatmentDate
"#CreatedBy32
"#Member_ID
.ExecuteScalar(command).ToS
Question:
I understand why it is working , but I dont understand why the following is not working :
value=value.replace(/\".+(?!.*\")/gi,'')
http://jsbin.com/fanep/4/edit
I mean : it suppose to search " where has charters after it , which doesn't has future " ....
What am I missing ? I really hate to declare [a-z,. ]
+ is greedy. Since "the whole thing" matches your rule of "must not have a " after", it will go with that.
The reason your first regex works is because you are disallowing most characters by explicitly whitelisting certain ones.
To fix, try adding ? after the + - this will make it lazy instead, matching as little as possible while still meeting the rules.
Additionally, you are searching for the stuff you want to keep... and then deleting it.
Try this instead:
val = val.replace(/"[^"]*(?=[\r\n]|$)/g,'');
This will remove everything from the last " to the end of a line (or end of the input).
value=value.replace(/\"[a-z,. ]+(?!.*\")/gi,'')
means: search the first " where have charters after it, which doesnot have a future "
To be exact: It matches the first " that has some of the characters [a-z,. ] after it, which then is not (in any distance) followed by another ".
I dont understand why the following is not working:
value=value.replace(/\".+(?!.*\")/gi,'')
You have removed the restriction of the character class. .+ will now match any char, including quotes. Regardless whether greedy or not, it will now find the first " that is followed by an amount of any characters (including other quotes) that are no more followed by quotes - i.e. it will suffice if .+ matches until the last quote.
I really hate to declare [a-z,. ]
You can just use the class of all characters except quotes: [^"]. Indeed, I think the following lookahead-free version matches your intent better:
value = value.replace(/"[^"\n\r]*/gi, '');
The one that doesn't work fails because the .+ is greedy. It eats up all it can. (Visual tools can help here, such as this one: http://regex101.com/r/eJ5kJ2/1) We can make it clearer that .+ is matching too much by putting it in a capture group: http://regex101.com/r/qF7nR9/1 Which show us:
In your one that does work (http://regex101.com/r/kR8vL6/1), you've changed that to [a-z,. ]+, which means "one or more a to z, comma, period, or space" (note that the . there is just a period, not a wildcard). That's much more limited (in particular, it doesn't include #).
Side note: There's no need to escape the " with a backslash, " isn't a special character in regular expressions.
Why the below regex is not working?
\".+(?!.*\")
Answer:
\" matches the first " and the following .+ would match greedily upto the last character. Because the last character in a line isn't followed by any character zero or more times plus \, the above regex would match the whole line undoubtably.
For your case, you could simply use the below regex to match from the second " upto the end of the line anchor.
\"[^"\n]*$
DEMO

Capture words not followed by symbol

I need to capture all (english) words except abbreviations whose pattern are:
"_any-word-symbols-including-dash."
(so there is underscore in the beginning and dot in the end an any letters and dash in the middle)
I tried smthing like this:
/\b([A-Za-z-^]+)\b[^\.]/g
but i seems that I don't understand how to work with negative matches.
UPDATE:
I need not just to match but wrap the words in some tags:
"a some words _abbr-abrr. a here" I should get:
<w>a</w> <w>some</w> <w>words</w> _abbr-abbr. <w>a</w> <w>here</w>
So I need to use replace with correct regex:
test.replace(/correct regex/, '<w>$1</w>')
Negative lookahead is (?!).
So you can use:
/\b([^_\s]\w*(?!\.))\b/g
Unfortunately, there is no lookbehind in javascript, so you can't do similar trick with "not prefixed by _".
Example:
> a = "a some words _abbr. a here"
> a.replace(/\b([^_\s]\w*(?!\.))\b/g, "<w>$1</w>")
"<w>a</w> <w>some</w> <w>words</w> _abbr. <w>a</w> <w>here</w>"
Following your comment with -. Updated regex is:
/\b([^_\s\-][\w\-]*(?!\.))\b/g
> "abc _abc-abc. abc".replace(/\b([^_\s\-][\w\-]*(?!\.))\b/g, "<w>$1</w>")
"<w>abc</w> _abc-abc. <w>abc</w>"

javascript url-safe filename-safe string

Looking for a regex/replace function to take a user inputted string say, "John Smith's Cool Page" and return a filename/url safe string like "john_smith_s_cool_page.html", or something to that extent.
Well, here's one that replaces anything that's not a letter or a number, and makes it all lower case, like your example.
var s = "John Smith's Cool Page";
var filename = s.replace(/[^a-z0-9]/gi, '_').toLowerCase();
Explanation:
The regular expression is /[^a-z0-9]/gi. Well, actually the gi at the end is just a set of options that are used when the expression is used.
i means "ignore upper/lower case differences"
g means "global", which really means that every match should be replaced, not just the first one.
So what we're looking as is really just [^a-z0-9]. Let's read it step-by-step:
The [ and ] define a "character class", which is a list of single-characters. If you'd write [one], then that would match either 'o' or 'n' or 'e'.
However, there's a ^ at the start of the list of characters. That means it should match only characters not in the list.
Finally, the list of characters is a-z0-9. Read this as "a through z and 0 through 9". It's a short way of writing abcdefghijklmnopqrstuvwxyz0123456789.
So basically, what the regular expression says is: "Find every letter that is not between 'a' and 'z' or between '0' and '9'".
I know the original poster asked for a simple Regular Expression, however, there is more involved in sanitizing filenames, including filename length, reserved filenames, and, of course reserved characters.
Take a look at the code in node-sanitize-filename for a more robust solution.
For more flexible and robust handling of unicode characters etc, you could use the slugify in conjunction with some regex to remove unsafe URL characters
const urlSafeFilename = slugify(filename, { remove: /"<>#%\{\}\|\\\^~\[\]`;\?:#=&/g });
This produces nice kebab-case filenemas in your url and allows for more characters outside the a-z0-9 range.
Here's what I did. It works to convert full sentences into a decently clean URL.
First it trims the string, then it converts spaces to dashes (-), then it gets rid of anything that's not a letter/number/dash
function slugify(title) {
return title
.trim()
.replace(/ +/g, '-')
.toLowerCase()
.replace(/[^a-z0-9-]/g, '')
}
slug.value = slugify(text.value);
text.oninput = () => { slug.value = slugify(text.value); };
<input id="text" value="Foo: the old #Foobîdoo!! " style="font-size:1.2em">
<input id="slug" readonly style="font-size:1.2em">
I think your requirement is to replaces white spaces and aphostophy `s with _ and append the .html at the end try to find such regex.
refer
http://www.regular-expressions.info/javascriptexample.html

Categories