Normalize Special Letter Char from Various Foreigner Languages - javascript

Forgive my ignorance for not knowing the technical term for foreigner languages that use characters as
ø i.e. Helsingør
Ł i.e Łeczna
ı i.e Altınordu
ł i.e. Głogow
how could I normalize those with Javascript (write a regex), while also making case insentive?
const strArr = ['Helsingør', 'Łeczna', 'Altınordu', 'Głogow', Népoão's]
With the code below, I was able to replace latin based characters (é, ñ, ô, etc) but not the above ones.
strArr.map(string => string.normalize('NFD')
.replaceAll(/[\u0300-\u036f,"\'"]/g, ''))
Mi final output should read as ['Helsingor', 'Leczna', 'Altinordu', 'Glogow', Nepoaos]

For replace non-ASCII characters with closet ASCII Character you need to use lodash library:
So your final code for replace special characters will be:
const _ = require('lodash');
const strArr = ['Helsingør', 'Łeczna', 'Altınordu', 'Głogow'];
const normalized = strArr.map(string => _.deburr(string));
console.log(normalized);
Output Result :
[ 'Helsingor', 'Leczna', 'Altinordu', 'Glogow' ]

As you discovered the .normalize('NFD') only decomposes Latin characters. You'd need to use library that normalizes other languages, possibly https://github.com/walling/unorm.
You can also roll your own. Here is a solution that uses the .normalize('NFD') in conjunction with a iMap object that maps from international to English characters. This iMap is short, you can expand as needed, such as taking the mapping from https://github.com/cvan/lunr-unicode-normalizer/blob/master/lunr.unicodeNormalizer.js. You can also override the mapping of .normalize('NFD'), for example, umlaut ü is better mapped to ue instead of u.
function normalizeString(str) {
const iMap = {
'ð': 'd',
'ı': 'i',
'Ł': 'L',
'ł': 'l',
'ø': 'o',
'ß': 'ss',
'ü': 'ue'
};
const iRegex = new RegExp(Object.keys(iMap).join('|'), 'g')
return str
.replace(iRegex, (m) => iMap[m])
.normalize("NFD")
.replace(/[\u0300-\u036f]/g, '');
}
[
'Helsingør', 'Łeczna', 'Altınordu', 'Głogow',
'Áfram með smjörið', 'Crème Brulée', 'Bär Müller Straße'
].forEach(str => {
let result = normalizeString(str);
console.log(str, '=>', result);
});
Output:
Helsingør => Helsingor
Łeczna => Leczna
Altınordu => Altinordu
Głogow => Glogow
Áfram með smjörið => Afram med smjorid
Crème Brulée => Creme Brulee
Bär Müller Straße => Bar Mueller Strasse

Related

Get multiple regex match result all in one match by Javascript

I am trying to make my code looks professional by removing those duplicate code. the question is I want to get some data from a string, to be specific, I need to know the NUMBER, X, Y, Z, A, B, etc. values but the regex expression are different for each variable so I have to repeat myself writing a lot of duplicate code.
let TextString = `DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
const regNumber = /(?<=NUMBER:=)[0-9]+/gm;
let lineNumber = Number(TextString.match(regNumber));
const regX = /(?<=X:=)(-?[0-9]+)(.[0-9]+)?/gm;
let X = Number(TextString.match(regX)).toFixed(1);
const regY = /(?<=Y:=)(-?[0-9]+)(.[0-9]+)?/gm;
let Y = Number(TextString.match(regY)).toFixed(1);
const regZ = /(?<=Z:=)(-?[0-9]+)(.[0-9]+)?/gm;
let Z = Number(TextString.match(regZ)).toFixed(1);
const regA = /(?<=A:=)(-?[0-9]+)(.[0-9]+)?/gm;
let A = Number(TextString.match(regA)).toFixed(1);
const regB = /(?<=B:=)(-?[0-9]+)(.[0-9]+)?/gm;
let B = Number(TextString.match(regB)).toFixed(1);
// and many more duplicate code.
console.log(lineNumber, X, Y, Z, A, B);
I could only think of a way like the above, to match each variable individually and run .match() multiple times, but as you can see there are 17 variables total and in real situations, there are hundreds of these TextString. I was worried that this matching process will have a huge impact on performance.
Are there any other ways to fetch all variables in one match and store them in an array or object? or any other elegant way of doing this?
Every coordinate will have a single letter identifier, so you can use a more general positive lookback (?<=,[A-Z]:=). This lookback matches a comma followed by a single uppercase letter then the equality symbol.
You can then use .match() to get all matches and use .map() to run the conversion you were doing.
let TextString = `DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
const regNumber = /(?<=NUMBER:=)[0-9]+/gm;
let lineNumber = Number(TextString.match(regNumber));
const regex = /(?<=,[A-Z]:=)(-?[0-9]+)(.[0-9]+)?/gm;
let coord = TextString.match(regex).map(n => Number(n).toFixed(1));
console.log(lineNumber, coord);
You could write a single pattern:
(?<=\b(?:NUMBER|[XYZAB]):=)-?\d+(?:\.\d+)?\b
Explanation
(?<= Positive lookbehind, assert that to the left of the current position is
\b(?:NUMBER|[XYZAB]):= Match either NUMBER or one of X Y Z A B preceded by a word boundary and followed by :=
) Close the lookbehind
-? Match an optional -
\d+(?:\.\d+)? Match 1+ digits and an optional decimal part
\b A word boundary to prevent a partial word match
See a regex demo.
const TextString = `DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
const regNumber = /(?<=\b(?:NUMBER|[XYZAB]):=)-?\d+(?:\.\d+)?\b/g;
const result = TextString
.match(regNumber)
.map(s =>
Number(s).toFixed(1)
);
console.log(result);
One possible approach could be based on a regex pattern which utilizes capturing groups. The matching regex for the OP's sample text would look like this ...
/\b(NUMBER|[XYZAB])\:=([^,]+),/g
... and the description is provided with the regex' test site.
The pattern is both simple and generic. The latter is due to always capturing both the matching key like Number and its related value like 20. Thus it doesn't matter where a key-value pair occurs within a drill-data string.
Making use later of an object based Destructuring Assignment for assigning all of the OP's variables at once the post processing task needs to reduce the result array of matchAll into an object which features all the captured keys and values. Within this task one also can control how the values are computed and/or whether or how the keys might get sanitized.
const regXDrillData = /\b(NUMBER|[XYZAB])\:=([^,]+),/g;
const textString =
`DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
// - processed values via reducing the captured
// groups of a `matchAll` result array of a
// generic drill-data match-pattern.
const {
number: lineNumber,
x, y, z,
a, b,
} = [...textString.matchAll(regXDrillData)]
.reduce((result, [match, key, value]) => {
value = Number(value);
value = (key !== 'NUMBER') ? value.toFixed(1) : value;
return Object.assign(result, { [ key.toLowerCase() ]: value });
}, {})
console.log(
`processed values via reducing the captured
groups of a 'matchAll' result array of a
generic drill-data match-pattern ...`,
{ lineNumber, x, y, z, a, b },
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
Every value match a pattern :=[value], or :=[value]) for the last one. So there is my regex
(?<=:=)-?[\d\w.']+(?=[,)])
Positive Lookbehind (?<=:=) look for match behind :=
-? match - optional (for negative number)
[\d\w.']+: match digit, word character, ., '
Positive Lookahead (?=[,)]) look for match ahead character , or )
Live regex101.com demo
Now change your code to
let TextString = `DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
const regexPattern= /(?<=:=)-?[\d\w.']+(?=[,)])/g;
console.log(TextString.match(regexPattern))
// ['20', "'4'", '1', '10.1', '73.344', '0', '-1.435', '1.045', '1', '2', '3', '4', '1', '10.5', '2.1', '1.2', '2', '2.4', '1', '2']
Edit
I just realized the the Positive Lookahead is unnecessary as #Peter Seliger
mentioned
(?<=:=)-?[\d\w.']+
Change your regex pattern to
const regexPattern= /(?<=:=)-?[\d\w.']+/g;
Here is a solution using a .reduce() on keys of interest and returns an object:
const TextString = `DRILL(NUMBER:=20,NAME:='4',PN:=1,X:=10.1,Y:=73.344,Z:=0,A:=-1.435,B:=1.045,M1:=1,M2:=2,M3:=3,M4:=4,M5:=1,S1:=10.5,S2:=2.1,S3:=1.2,S4:=2,S5:=2.4,RS1:=1,RS2:=2);`;
const keys = [ 'NUMBER', 'X', 'Y', 'Z', 'A', 'B' ];
let result = keys.reduce((obj, key) => {
const regex = new RegExp('(?<=\\b' + key + ':=)-?[0-9.]+');
obj[key] = Number(TextString.match(regex)).toFixed(1);
return obj;
}, {});
console.log(result);
Output:
{
"NUMBER": "20.0",
"X": "10.1",
"Y": "73.3",
"Z": "0.0",
"A": "-1.4",
"B": "1.0"
}
Notes:
The regex is built dynamically from the key
A \b word boundary is added to the regex to reduce the chance of unintended matches
If you need the line number as an integer you could take that out of the keys, and handle it separately.

split on a multi character delimiter using regex expression

I am using this code to split text into segments:
let value = "(germany or croatia) and (denmark or hungary)"
let tokens = value.split(/(?=[()\s])|(?<=[()\s])/g).filter(segment => segment.trim() != '');
This produces the following array:
['(', 'germany', 'or', 'croatia', ')', 'and', '(', 'denmark', 'or', 'hungary', ')']
How should I rewrite regex so that it would be able to split this string:
(germany*or*croatia)*and*(denmark*or*hungary)
into
['(', 'germany', '*or*', 'croatia', ')', '*and*', '(', 'denmark', '*or*', 'hungary', ')']
The problem is that split paramrs *or* and *and* are of multiple characters and just using
let tokens = value.split(/(?=[()\s\*or\*\*and\*])|(?<=[()\s\*or\*\*and\*])/g).filter(segment => segment.trim() != '');
will not work.
Instead of matching the space between tokens, you will have an easier time trying to the tokens themselves.
const tokenizer = /\(|\)|\*[^()*\s]+\*|[^()*\s]+/g;
const value = "(germany*or*croatia)*and*(denmark*or*hungary)";
const tokens = value.matchAll(tokenizer);
console.log(Array.from(tokens, match => match[0]));
Note that this isn't very robust, as any unexpected token is just ignored silently. It's also very generic; you might have an easier time specifically looking for the list of allowed operators, like *or* and *and*, instead of producing a token for any *something* found.
If you want to validate each token further, you can wrap each one in a capture group, and add a capture group for whitespace and for any unexpected leftovers. Keep in mind the order of | alternatives in a regex matters!
const tokenizer = /(\()|(\))|(\*[^()*\s]+\*)|([^()*\s]+)|(\s+)|(.+)/g;
const symbols = [
Symbol("open-paren"),
Symbol("close-paren"),
Symbol("*operator*"),
Symbol("term"),
Symbol("whitespace"),
Symbol("unexpected"),
];
const value = "(germany*or*croatia)*and*(denmark or hungary)***";
const matches = value.matchAll(tokenizer);
for (match of matches) {
const str = match[0];
const groups = match.slice(1);
const tokenType = symbols[groups.findIndex(capture => capture !== undefined)];
console.log([tokenType.toString(), str]);
}
This kind of tokenizing regex was inspired by one of Douglas Crockford's books, where he uses something similar to transpile his own programming language into JS -- I forget which one.
Could you try this expression?
/(\()|(\))|(\*or\*)|(\*and\*)|[\w]*/g

RegExp to match lonely or single uppercase letters

I am trying to create a pipe in angular, but I cannot seem to wrap my head around regular expressions.
I am trying to match only lonely or single uppercase letters and split them.
Let us say for instance I have the following:
thisIsAnApple should return [this, Is, An, Apple]
thisIsAString should return [this, Is, AString]
BB88GGFR should return [BB88GGFR]
So the plan is to match a capital letter if it is not accompanied by another capital letter.
This is what I have come up with:
const regExr = new RegExp(/(?=[A-Z])(?![A-Z]{2,})/g);
let split = string.split(regExr);
You may use
string.split(/(?<=[a-z])(?=[A-Z])/)
Or, to support all Unicode letters:
string.split(/(?<=\p{Ll})(?=\p{Lu})/u)
Basically, you match an empty string between a lowercase and an uppercase letter.
JS demo:
const strings = ['thisIsAnApple', 'thisIsAString', 'BB88GGFR'];
const regex = /(?<=[a-z])(?=[A-Z])/;
for (let s of strings) {
console.log(s, '=>', s.split(regex));
}
JS demo #2:
const strings = ['thisIsĄnApple', 'этоСТрокаТакая', 'BB88GGFR'];
const regex = /(?<=\p{Ll})(?=\p{Lu})/u;
for (let s of strings) {
console.log(s, '=>', s.split(regex));
}

Split sentence into words according to it language correctly with JavaScript [duplicate]

How do you convert a string to a character array in JavaScript?
I'm thinking getting a string like "Hello world!" to the array
['H','e','l','l','o',' ','w','o','r','l','d','!']
Note: This is not unicode compliant. "I💖U".split('') results in the
4 character array ["I", "�", "�", "u"] which can lead to dangerous
bugs. See answers below for safe alternatives.
Just split it by an empty string.
var output = "Hello world!".split('');
console.log(output);
See the String.prototype.split() MDN docs.
As hippietrail suggests, meder's answer can break
surrogate pairs and misinterpret “characters.” For example:
// DO NOT USE THIS!
const a = '𝟘𝟙𝟚𝟛'.split('');
console.log(a);
// Output: ["�","�","�","�","�","�","�","�"]
I suggest using one of the following ES2015 features to correctly handle these
character sequences.
Spread syntax (already answered by insertusernamehere)
const a = [...'𝟘𝟙𝟚𝟛'];
console.log(a);
Array.from
const a = Array.from('𝟘𝟙𝟚𝟛');
console.log(a);
RegExp u flag
const a = '𝟘𝟙𝟚𝟛'.split(/(?=[\s\S])/u);
console.log(a);
Use /(?=[\s\S])/u instead of /(?=.)/u because . does not match
newlines. If you are still in ES5.1 era (or if your browser doesn't
handle this regex correctly - like Edge), you can use the following alternative
(transpiled by Babel). Note, that Babel tries to also handle unmatched
surrogates correctly. However, this doesn't seem to work for unmatched low
surrogates.
const a = '𝟘𝟙𝟚𝟛'.split(/(?=(?:[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]))/);
console.log(a);
Reduce method (already answered by Mark Amery)
const s = '𝟘𝟙𝟚𝟛';
const a = [];
for (const s2 of s) {
a.push(s2);
}
console.log(a);
The spread Syntax
You can use the spread syntax, an Array Initializer introduced in ECMAScript 2015 (ES6) standard:
var arr = [...str];
Examples
function a() {
return arguments;
}
var str = 'Hello World';
var arr1 = [...str],
arr2 = [...'Hello World'],
arr3 = new Array(...str),
arr4 = a(...str);
console.log(arr1, arr2, arr3, arr4);
The first three result in:
["H", "e", "l", "l", "o", " ", "W", "o", "r", "l", "d"]
The last one results in
{0: "H", 1: "e", 2: "l", 3: "l", 4: "o", 5: " ", 6: "W", 7: "o", 8: "r", 9: "l", 10: "d"}
Browser Support
Check the ECMAScript ES6 compatibility table.
Further reading
MDN: Spread operator
ECMAScript 2015 (ES6): 12.2.5 Array Initializer
spread is also referenced as "splat" (e.g. in PHP or Ruby or as "scatter" (e.g. in Python).
Demo
Try before buy
You can also use Array.from.
var m = "Hello world!";
console.log(Array.from(m))
This method has been introduced in ES6.
Reference
Array.from
There are (at least) three different things you might conceive of as a "character", and consequently, three different categories of approach you might want to use.
Splitting into UTF-16 code units
JavaScript strings were originally invented as sequences of UTF-16 code units, back at a point in history when there was a one-to-one relationship between UTF-16 code units and Unicode code points. The .length property of a string measures its length in UTF-16 code units, and when you do someString[i] you get the ith UTF-16 code unit of someString.
Consequently, you can get an array of UTF-16 code units from a string by using a C-style for-loop with an index variable...
const yourString = 'Hello, World!';
const charArray = [];
for (let i=0; i<=yourString.length; i++) {
charArray.push(yourString[i]);
}
console.log(charArray);
There are also various short ways to achieve the same thing, like using .split() with the empty string as a separator:
const charArray = 'Hello, World!'.split('');
console.log(charArray);
However, if your string contains code points that are made up of multiple UTF-16 code units, this will split them into individual code units, which may not be what you want. For instance, the string '𝟘𝟙𝟚𝟛' is made up of four unicode code points (code points 0x1D7D8 through 0x1D7DB) which, in UTF-16, are each made up of two UTF-16 code units. If we split that string using the methods above, we'll get an array of eight code units:
const yourString = '𝟘𝟙𝟚𝟛';
console.log('First code unit:', yourString[0]);
const charArray = yourString.split('');
console.log('charArray:', charArray);
Splitting into Unicode Code Points
So, perhaps we want to instead split our string into Unicode Code Points! That's been possible since ECMAScript 2015 added the concept of an iterable to the language. Strings are now iterables, and when you iterate over them (e.g. with a for...of loop), you get Unicode code points, not UTF-16 code units:
const yourString = '𝟘𝟙𝟚𝟛';
const charArray = [];
for (const char of yourString) {
charArray.push(char);
}
console.log(charArray);
We can shorten this using Array.from, which iterates over the iterable it's passed implicitly:
const yourString = '𝟘𝟙𝟚𝟛';
const charArray = Array.from(yourString);
console.log(charArray);
However, unicode code points are not the largest possible thing that could possibly be considered a "character" either. Some examples of things that could reasonably be considered a single "character" but be made up of multiple code points include:
Accented characters, if the accent is applied with a combining code point
Flags
Some emojis
We can see below that if we try to convert a string with such characters into an array via the iteration mechanism above, the characters end up broken up in the resulting array. (In case any of the characters don't render on your system, yourString below consists of a capital A with an acute accent, followed by the flag of the United Kingdom, followed by a black woman.)
const yourString = 'Á🇬🇧👩🏿';
const charArray = Array.from(yourString);
console.log(charArray);
If we want to keep each of these as a single item in our final array, then we need an array of graphemes, not code points.
Splitting into graphemes
JavaScript has no built-in support for this - at least not yet. So we need a library that understands and implements the Unicode rules for what combination of code points constitute a grapheme. Fortunately, one exists: orling's grapheme-splitter. You'll want to install it with npm or, if you're not using npm, download the index.js file and serve it with a <script> tag. For this demo, I'll load it from jsDelivr.
grapheme-splitter gives us a GraphemeSplitter class with three methods: splitGraphemes, iterateGraphemes, and countGraphemes. Naturally, we want splitGraphemes:
const splitter = new GraphemeSplitter();
const yourString = 'Á🇬🇧👩🏿';
const charArray = splitter.splitGraphemes(yourString);
console.log(charArray);
<script src="https://cdn.jsdelivr.net/npm/grapheme-splitter#1.0.4/index.js"></script>
And there we are - an array of three graphemes, which is probably what you wanted.
This is an old question but I came across another solution not yet listed.
You can use the Object.assign function to get the desired output:
var output = Object.assign([], "Hello, world!");
console.log(output);
// [ 'H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!' ]
Not necessarily right or wrong, just another option.
Object.assign is described well at the MDN site.
It already is:
var mystring = 'foobar';
console.log(mystring[0]); // Outputs 'f'
console.log(mystring[3]); // Outputs 'b'
Or for a more older browser friendly version, use:
var mystring = 'foobar';
console.log(mystring.charAt(3)); // Outputs 'b'
The ES6 way to split a string into an array character-wise is by using the spread operator. It is simple and nice.
array = [...myString];
Example:
let myString = "Hello world!"
array = [...myString];
console.log(array);
// another example:
console.log([..."another splitted text"]);
4 Ways you can convert a String to character Array in JavaScript :
const string = 'word';
// Option 1
string.split(''); // ['w', 'o', 'r', 'd']
// Option 2
[...string]; // ['w', 'o', 'r', 'd']
// Option 3
Array.from(string); // ['w', 'o', 'r', 'd']
// Option 4
Object.assign([], string); // ['w', 'o', 'r', 'd']
You can iterate over the length of the string and push the character at each position:
const str = 'Hello World';
const stringToArray = (text) => {
var chars = [];
for (var i = 0; i < text.length; i++) {
chars.push(text[i]);
}
return chars
}
console.log(stringToArray(str))
simple answer:
let str = 'this is string, length is >26';
console.log([...str]);
Array.prototype.slice will do the work as well.
const result = Array.prototype.slice.call("Hello world!");
console.log(result);
How about this?
function stringToArray(string) {
let length = string.length;
let array = new Array(length);
while (length--) {
array[length] = string[length];
}
return array;
}
One possibility is the next:
console.log([1, 2, 3].map(e => Math.random().toString(36).slice(2)).join('').split('').map(e => Math.random() > 0.5 ? e.toUpperCase() : e).join(''));

How can I set a regex for special condition in JavaScript?

I need help for writing a regex pattern fo these conditions:
Limitations on Hashtag Characters
Length
You only need to add a # before a word to make it hashtag. However, because a Tweet is only limited to under 140 characters, the best hashtags are those composed of a single word or a few letters. Twitter experts recommend keeping the keyword under 6 characters.
Use only numbers and letters in your keyword. You may use an underscore but do this sparingly for aesthetic reasons. Hyphens and dashes will not work.
No Spaces
Hashtags do not support spaces. So if you're using two words, skip the space. For example, hashtags for following the US election are tagged as #USelection, not $US election.
No Special Characters
Hashtags only work with the # sign. Special characters like "!, $, %, ^, &, *, +, ." will not work. Twitter recognizes the pound sign and then converts the hashtag into a clickable link.
HashTags can start by numbers
Hashtags can be in any language
Hashtags can be emojis or symbols
I came up by the idea like this but it's not including the last two conditions:
const subStr = postText.split(/(?=[\s:#,+/][a-zA-Z\d]+)(#+\w{2,})/gm);
const result = _.filter(subStr, word => word.startsWith('#')).map(hashTag => hashTag.substr(1)) || [];
EDIT:
Example: If I have:
const postText = "#hello12#123 #hi #£hihi #This is #👩 #Hyvääpäivää #Dzieńdobry #जलवायुपरिवर्तन an #example of some text with #hash-tags - http://www.example.com/#anchor but dont want the link,#hashtag1,hi #123 hfg skjdf kjsdhf jsdhf kjhsdf kjhsdf khdsf kjhsdf kjhdsf hjjhjhf kjhsdjhd kjhsdfkjhsd #lasthashtag";
Result should be:
["hello12", "123", "hi", "This", "👩", "Hyvääpäivää", "Dzieńdobry", "जलवायुपरिवर्तन", "example", "hash", "anchor", "hashtag1", "123", "lasthashtag"]
What I have now:
["hello12", "123", "hi", "This", "Hyv", "Dzie", "example", "hash", "anchor", "hashtag1", "123", "lasthashtag"]
Note: I don't want to use JavaScript library.
Thanks
Assuming the characters that are not allowed in a hashtag are !$%^&*+. (the ones you mentioned) and , (based on your example), you can use the following regex pattern:
/#[^\s!$%^&*+.,#]+/gm
Here's a demo.
Note: To exclude more characters, you can add them in the character class as I did above. Obviously, you can't rely on alphanumeric characters only because you want to support other Unicode symbols and emojis.
JavaScript code sample:
const regex = /#[^\s!$%^&*+.,#]+/gm;
const str = "#hello12#123 #hi #£hihi #This is #👩 #Hyvääpäivää #Dzieńdobry #जलवायुपरिवर्तन an #example of some text with #hash-tags - http://www.example.com/#anchor but dont want the link,#hashtag1,hi #123 hfg skjdf kjsdhf jsdhf kjhsdf kjhsdf khdsf kjhsdf kjhdsf hjjhjhf kjhsdjhd kjhsdfkjhsd #lasthashtag";
let m;
while ((m = regex.exec(str)) !== null) {
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
m.forEach((match) => {
console.log("Found match: " + match);
});
}
This is one possible solution without while that worked for me and Thanks #Ahmed Abdelhameed for the pattern :
function getHashTags(postText) {
const regex = /#[^\s!$%^&*+.,£#]+/gm;
const selectedHashTag = [];
const subStr = postText.split(' ');
const checkHashTag = _.filter(subStr, word => word.startsWith('#') || word.includes('#'));
checkHashTag.map((hashTags) => {
if (hashTags.match(regex)) {
hashTags.match(regex).map(hashTag => selectedHashTag.push(hashTag.substr(1)));
}
return true;
});
return selectedHashTag;
}

Categories