Regex to match words with hyphens and/or apostrophes - javascript

I was looking for a regex to match words with hyphens and/or apostrophes. So far, I have:
(\w+([-'])(\w+)?[']?(\w+))
and that works most of the time, though if there's a apostrophe and then a hyphen, like "qu'est-ce", it doesn't match. I could append more optionals, though perhaps there's another more efficient way?
Some examples of what I'm trying to match: Mary's, High-school, 'tis, Chambers', Qu'est-ce.

use this pattern
(?=\S*['-])([a-zA-Z'-]+)
Demo
(?= # Look-Ahead
\S # <not a whitespace character>
* # (zero or more)(greedy)
['-] # Character in ['-] Character Class
) # End of Look-Ahead
( # Capturing Group (1)
[a-zA-Z'-] # Character in [a-zA-Z'-] Character Class
+ # (one or more)(greedy)
) # End of Capturing Group (1)

[\w'-]+ would match pretty much any occurrence of words with (or without) hyphens and apostrophes, but also in cases where those characters are adjacent.
(?:\w|['-]\w)+ should match cases where the characters can't be adjacent.
If you need to be sure that the word contains hyphens and/or apostrophes and that those characters aren't adjacent maybe try \w*(?:['-](?!['-])\w*)+. But that would also match ' and - alone.

debuggex.com is a great resource for visualizing these sorts of things
\b\w*[-']\w*\b should do the trick

The problem you're running into is that you actually have three possible sub-patterns: one or more chars, an apostrophe followed by one or more chars, and a hyphen followed by one or more chars.
This presumes you don't wish to accept words that begin or end with apostrophes or hyphens or have hyphens next to apostrophes (or vice versa).
I believe the best way to represent this in a RegExp would be:
/\b[a-z]+(?:['-]?[a-z]+)*\b/
which is described as:
\b # word-break
[a-z]+ # one or more
(?: # start non-matching group
['-]? # zero or one
[a-z]+ # one or more
)* # end of non-matching group, zero or more
\b # word-break
which will match any word that begins and ends with an alpha and can contain zero or more groups of either a apos or a hyphen followed by one or more alpha.

How about: \'?\w+([-']\w+)*\'?
demo
I suppose these words shouldn't be matched:
something- or -something: start or end with -
some--thing or some'-thing: - not followed by a character
some'': two hyphens

This worked for me:
([a-zA-Z]+'?-?[a-zA-Z]+(-?[a-zA-Z])?)|[a-zA-Z]

Use
([\w]+[']*[\w]*)|([']*[\w]+)
It will properly parse
"You've and we i've it' '98"
(supports ' in any place in the word but single ' is ignored).
If needed \w could be replaced with [a-zA-Z] etc.

Related

regex to replace regular quotes with curly quotes

I have a block of text where the opening and closing quotes are same
"Hey", How are you? "Hey there"... “Some more text” and some more "here".
Please note that the quote character is " and not “ ” these characters
(["'])(?:(?=(\\?))\2.)*?\1
I want to replace the opening " character as “
it will now look as
“Hey", How are you? “Hey there"... “Some more text” and some more “here".
and then again running I can simply find and replace the left over " occurance as ”
and that would give the expected output which should look as
“Hey”, How are you? “Hey there”... “Some more text” and some more “here”.
My preference would be for the solution given by #WiktorStribiżew in a comment on the question, but I wish to give an alternative solution that may be of interest to some readers.
The second replacement of the remaining (trailing) double-quotes (i.e., ASCII 32) is straightforward, so I will not discuss that.
You could match leading double-quotes with the following regular expression, and then replace each match with “:
"(?=(?:(?:[^"]*"){2})*[^"]*"[^"]*$)
Demo
This regex is based on the observation that we want to identify all double-quotes that are followed later in the string by an odd number of double-quotes (assuming the string contains an even number of double-quotes.
The regular expression can be broken down as follows.
" # match a double-quote (dq)
(?= # begin a positive lookahead
(?: # begin a non-capture group
(?: # begin a non-capture group
[^"]*" # match 0+ chars other than dq then match dq
){2} # end non-capture group and execute it twice
)* # end non-capture group and execute it 0+ times
[^"]*"[^"]* # match dq preceded and followed by 0+ non-dq chars
$ # match end of string
) # end positive lookahead
If the data set is large it may be advisable to perform some benchmarking to see if execution speed is satisfactory.

How to modify this hashtag regex to check if the second character is a-z or A-Z?

I'm building on a regular expression I found that works well for my use case. The purpose is to check for what I consider valid hashtags (I know there's a ton of hashtag regex posts on SO but this question is specific).
Here's the regex I'm using
/(^|\B)#(?![0-9_]+\b)([a-zA-Z0-9_]{1,20})(\b|\r)/g
The only problem I'm having is I can't figure out how to check if the second character is a-z (the first character would be the hashtag). I only want the first character after the hashtag to be a-z or A-Z. No numbers or non-alphanumeric.
Any help much appreciated, I'm very novice when it comes to regular expressions.
As I mentioned in the comments, you can replace [a-zA-Z0-9_]{1,20} with [a-zA-Z][a-zA-Z0-9_]{0,19} so that the first character is guaranteed to be a letter and then followed by 0 to 19 word characters (alphanumeric or underscore).
However, there are other unnecessary parts in your pattern. It appears that all you need is something like this:
/(?:^|\B)#[a-zA-Z][a-zA-Z0-9_]{0,19}\b/g
Demo.
Breakdown of (?:^|\B):
(?: # Start of a non-capturing group (don't use a capturing group unless needed).
^ # Beginning of the string/line.
| # Alternation (OR).
\B # The opposite of `\b`. In other words, it makes sure that
# the `#` is not preceded by a word character.
) # End of the non-capturing group.
Note: You may also replace [a-zA-Z0-9_] with \w.
References:
Word Boundaries.
Difference between \b and \B in regex.
The below should work.
(^|\B)#(?![0-9_]+\b)([a-zA-Z][a-zA-Z0-9_]{0,19})(\b|\r)
If you only want to accept two or more letter hashtags then change {0,19} with {1,19}.
You can test it here
In your pattern you use (?![0-9_]+\b) which asserts that what is directly on the right is not a digit or an underscore and can match a lot of other characters as well besides an upper or lower case a-z.
If you want you can use this part [a-zA-Z0-9_]{1,20} but then you have to use a positive lookahead instead (?=[a-zA-Z]) to assert what is directly to the right is an upper or lower case a-z.
(?:^|\B)#(?=[a-zA-Z])[a-zA-Z0-9_]{1,20}\b
Regex demo

Regex match string until whitespace Javascript

I want to be able to match the following examples:
www.example.com
http://example.com
https://example.com
I have the following regex which does NOT match www. but will match http:// https://. I need to match any prefix in the examples above and up until the next white space thus the entire URL.
var regx = ((\s))(http?:\/\/)|(https?:\/\/)|(www\.)(?=\s{1});
Lets say I have a string that looks like the following:
I have found a lot of help off www.stackoverflow.com and the people on there!
I want to run the matching on that string and get
www.stackoverflow.com
Thanks!
You can try
(?:www|https?)[^\s]+
Here is online demo
sample code:
var str="I have found a lot of help off www.stackoverflow.com and the people on there!";
var found=str.match(/(?:www|https?)[^\s]+/gi);
alert(found);
Pattern explanation:
(?: group, but do not capture:
www 'www'
| OR
http 'http'
s? 's' (optional)
) end of grouping
[^\s]+ any character except: whitespace
(\n, \r, \t, \f, and " ") (1 or more times)
You have an error in your regex.
Use this:
((\s))(http?:\/\/)|(https?:\/\/)|(www\.)(?!\s{1})
^--- Change to negative lookaround
Btw, I think you can use:
(?:(http?:\/\/)|(https?:\/\/)|(www\.))(?!\s{1})
MATCH 1
3. [0-4] `www.`
MATCH 2
1. [16-23] `http://`
MATCH 3
2. [35-43] `https://`
Not quite sure what you're trying to do, but this should match any group of non-space characters not immediately preceded with "www." case insensitive.
/(https?:\/\/)?(?<!(www\.))[^\s]*/i
... [edit] but you did want to match www.
/(https?:\/\/)?([^\s\.]{2,}\.?)+/i
First things first, to match any whitespace char, use \S construct (in POSIX, you would use [^[:space:]], but JavaScript regex is not POSIX compliant). Here are some common patterns with \S:
\S* - zero or more non-whitespace chars
\S+ - one or more non-whitespace chars
Matching any text until first whitespace can mean match any zero or more chars other than whitespace, so, the answer to the current OP problem is
(?:www|https?)\S*
// ^^^
See the regex demo. This pattern will match up to the first whitespace or end of string. If there must be a whitespace char on the right use
(?:www|https?)\S*(?=\s)
The (?=\s) positive lookahead requires a whitespace immediately to the right of the current location.
Whenver there is a need to match until last whitespace you could match any zero or more chars that are followed with a whitespace, \s, pattern:
/(?:www|https?)[\w\W]*(?=\s)/
/(?:www|https?)[^]*(?=\s)/
// Or even (for ECMAScript 2018+):
/(?:www|https?).*(?=\s)/s
The [\w\W], [^] and . with s flag match any char including line break chars.

regex pattern to match a type of strings

I need to match the below type of strings using a regex pattern in javascript.
E.g. /this/<one or more than one word with hyphen>/<one or more than one word with hyphen>/<one or more than one word with hyphen>/<one or more than one word with hyphen>
So this single pattern should match both these strings:
1. /this/is/single-word
2. /this/is-more-than/single/word-patterns/to-match
Only the slash (/) and the 'this' string in the beginning are consistent and contains only alphabets.
You can use:
\/this\/[a-zA-Z ]+\/[a-zA-Z ]+\/[a-zA-Z ]+
Working Demo
I think you want something like this maybe?
(\/this\/(\w+\s?){1,}\/\w+\/(\w+\s?)+)
break down:
\/ # divder
this # keyword
\/ # divider
( # begin section
\w+ # single valid word character
\s? # possibly followed by a space
) # end section
{1,} # match previous section at least 1 times, more if possible.
\/ # divider
\w+ # single valid word character
\/ # divider
( # begin section
\w+ # single valid word character
\s? # possible space
) # end section
Working example
This might be obvious, however to match each pattern as a separate result, I believe you want to place parenthesis around the whole expression, like so:
(\/[a-zA-Z ]+\/[a-zA-Z ]+\/[a-zA-Z ]+\/[a-zA-Z ]+)
This makes sure that TWO results are returned, not just one big group.
Also, your question did not state that "this" would be static, as the other answers assumed... it says only the slashes are static. This should work for any text combo (no word this required).
Edit - actually looking back at your attempt, I see you used /this/ in your expression, so I assume that's why others did as well.
Demo: http://rubular.com/r/HGYp2qtmAM
Modified question samples:
/this/is/single-word
/this/is-more-than/single/word-patterns/to-match
Modified again The sections may have hyphen (no spaces) and there may be 3 or 4 sections beyond '/this/'
Modified pattern /^\/this(?:\/[a-zA-Z]+(?:-[a-zA-Z]+)*){3,4}$/
^
/this
(?:
/ [a-zA-Z]+
(?: - [a-zA-Z]+ )*
){3,4}
$

Regex, replace all words starting with #

I have this regular expression that puts all the words that starts with # into span tags.
I've accomplished what is needed but i'm not sure that i completely understand what i did here.
content.replace(/(#\S+)/gi,"<span>$1</span>")
The () means to match a whole word, right?
The # means start with #.
The \S means "followed by anything until a whitespaces" .
But how come that if don't add the + sign after the \S , it matches only the first letter?
Any input would be appreciated .
\S is any non-whitespace character and a+ means one or more of a. So
#\S -> An # followed by one non-whitespace character.
#\S+ -> An # followed by one or more non-whitespace characters
Sharing code to change hashtags into links
var p = $("p");
var string = p.text();
p.html(string.replace(/#(\S+)/gi,'#$1'));
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<p>Test you code here #abc #123 #xyz</p>
content.replace(/(#\S+)/gi,"<span>$1</span>")
(#\S+) is a capturing group which captures # followed by 1 or more (+ means 1 or more) non-whitespace characters (\S is a non-whitespace character)
g means global, ie replace all instances, not just the first match
i means case insensitive
$1 fetches what was captured by the first capturing group.
So, the i is unnecessary, but won't affect anything.
/(#\S+)gi/
1st Capturing group (#\S+)
# matches the character # literally
\S+ match any non-white space character [^\r\n\t\f ]
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
g - all the matches not just first
i - case insensitive match
The \S means "followed by anything until a whitespaces" .
That's not what \S means. It's "any character that's not a whitespace", that is, one character that's not a whitespace.

Categories