Split string by HTML entities?

Split string by HTML entities? - javascript

My string contain a lot of HTML entities, like this
"Hello <everybody> there"
And I want to split it by HTML entities into this :
Hello
everybody
there
Can anybody suggest me a way to do this please? May be using Regex?

It looks like you can just split on &[^;]*; regex. That is, the delimiter are strings that starts with &, ends with ;, and in between there can be anything but ;.
If you can have multiple delimiters in a row, and you don't want the empty strings between them, just use (&[^;]*;)+ (or in general (delim)+ pattern).
If you can have delimiters in the beginning or front of the string, and you don't want them the empty strings caused by them, then just trim them away before you split.
Example
Here's a snippet to demonstrate the above ideas (see also on ideone.com):
var s = ""Hello <everybody> there""
print (s.split(/&[^;]*;/));
// ,Hello,,everybody,,there,
print (s.split(/(?:&[^;]*;)+/));
// ,Hello,everybody,there,
print (
s.replace(/^(?:&[^;]*;)+/, "")
.replace(/(?:&[^;]*;)+$/, "")
.split(/(?:&[^;]*;)+/)
);
// Hello,everybody,there

var a = str.split(/\&[#a-z0-9]+\;/); should do it, although you'll end up with empty slots in the array when you have two entities next to each other.

split(/&.*?;(?=[^&]|$)/)
and cut the last and first result:
["", "Hello", "everybody", "there", ""]

>> ""Hello <everybody> there"".split(/(?:&[^;]+;)+/)
['', 'Hello', 'everybody', 'there', '']
The regex is: /(?:&[^;]+;)+/
Matches entities as & followed by 1+ non-; characters, followed by a ;. Then matches at least one of those (or more) as the split delimiter. The (?:expression) non-capturing syntax is used so that the delimiters captured don't get put into the result array (split() puts capture groups into the result array if they appear in the pattern).

Related

How can I include the delimiter with regex String.split()?

I need to parse the tokens from a GS1 UDI format string:
"(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"
I would like to split that string with a regex on the "(nnn)" and have the delimiter included with the split values, like this:
[ "(20)987111", "(240)A", "(10)ABC123", "(17)2022-04-01", "(21)888888888888888" ]
Below is a JSFiddle with examples, but in case you want to see it right here:
// This includes the delimiter match in the results, but I want the delimiter included WITH the value
// after it, e.g.: ["(20)987111", ...]
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\))/).filter(Boolean))
// Result: ["(20)", "987111", "(240)", "A", "(10)", "ABC123", "(17)", "2022-04-01", "(21)", "888888888888888"]
// If I include a pattern that should (I think) match the content following the delimiter I will
// only get a single result that is the full string:
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\)\W+)/).filter(Boolean))
// Result: ["(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"]
// I think this is because I'm effectively mathching the entire string, hence a single result.
// So now I'll try to match only up to the start of the next "(":
str = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888";
console.log(str.split(/(\(\d{2,}\)(^\())/).filter(Boolean))
// Result: ["(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"]
I've found and read this question, however the examples there are matching literals and I'm using character classes and getting different results.
I'm failing to create a regex pattern that will provide what I'm after. Here's a JSFiddle of some of the things I've tried: https://jsfiddle.net/6bogpqLy/
I can't guarantee the order of the "application identifiers" in the input string and as such, match with named captures isn't an attractive option.

You can split on positions where parenthesised element follows, by using a zero-length lookahead assertion:
const text = "(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888"
const parts = text.split(/(?=\(\d+\))/)
console.log(parts)

Instead of split use match to create the array. Then find 1) digits in parenthesis, followed by a group that might contain a digit, a letter, or a hyphen, and then 2) group that whole query.
(PS. I often find a site like Regex101 really helps when it comes to testing out expressions outside of a development environment.)
const re = /(\(\d+\)[\d\-A-Z]+)/g;
const str = '(20)987111(240)A(10)ABC123(17)2022-04-01(21)888888888888888';
console.log(str.match(re));

Extract numeric and text parts of a string, in varying formats

I'm trying to put together a RegEx to split a variety of possible user inputs, and while I've managed to succeed with some cases, I've not managed to cover every case that I'd like to.
Possible inputs, and expected outputs
"1 day" > [1,"day"]
"1day" > [1,"day"]
"10,000 days" > [10000,"days"]
Is it possible to split the numeric and text parts from the string without necessarily having a space, and to also remove the commas etc from the string at the same time?
This is what I've got at the moment
[a-zA-Z]+|[0-9]+
Which seems to split the numeric and text portions nicely, but is tripped up by commas. (Actually, as I write this, I'm thinking I could use the last part of the results array as the text part, and concatenate all the other parts as the numeric part?)

var test = [
'1 day',
'1day',
'10,000 days',
];
console.log(test.map(function (a) {
a = a.replace(/(\d),(\d)/g, '$1$2'); // remove the commas
return a.match(/^(\d+)\s*(.+)$/); // split in two parts
}));

This regular expression works, apart from removing the comma from the matched number string:
([0-9,]+]) *(.*)
You cannot "ignore" a character in a returned regular expression match string, so you will just have to remove the comma from the returned regex match afterwards.

How to write regex for this javascript string

How to write this string below
"(22.0796251, 82.13914120000004),36", "(22.744108, 77.73696700000005),48",...and so on
Like this:
(22.0796251, 82.13914120000004) 36
(22.744108, 77.73696700000005) 48
...and so on.................. ..
How to do this using regex in javscript ?
My try is this:
substring = test.split(',');
where test contains the data to be formatted. But its wrong

You should use the ability of split to split on regular expressions and then keep them in the results. To do this, simply put a capturing group in the regexp. In your case, you will "split" on things in double quote marks:
pieces = test.split(/(".*?")/)
^^^^^^^ CAPTURE GROUP
// ["", ""(22.0796251, 82.13914120000004),36"", ", ", ""(22.744108, 77.73696700000005),48"", ""]
The question mark is to make sure it doesn't eat up all the characters up through the last quote in the input. It makes the * quantifier "non-greedy".
Now get rid of the junk (empty strings and ", "):
pieces = pieces . filter (function(seg) { return !/^[, ]*$/.test(seg); })
// ["(22.0796251, 82.13914120000004),36", "(22.744108, 77.73696700000005),48"]
Next you can break down each piece with another regexp, as in
arrays = pieces . map(function(piece) { return piece.match(/(.*), (.*)/).slice(1); });
// [["(22.0796251, 82.13914120000004)", "36"], ["(22.744108, 87.73696700000005)", "48"]]
The slice is to get rid of the first element of the array returned by match, which is the entire match and we don't need that.
Now print out arrays, split its elements further, or do whatever else you want with it.

Javascript - regular expression to split string on unescaped character, e.g. | but ignore \|

I read a string from file that I split on | character. For example the string is
1|test pattern|prefix|url|postfix
So split must always give me 5 substrings, which in the above case are
["1", "test pattern", "prefix", "url", "postfix"]
The problem comes in when any of these five substrings contains | character. I would store it as escaped \|
1|test pattern|prefix|url \| title |postfix
Now, you can see that string.split('|') won't give me the desired result. The desired result is
["1", "test pattern", "prefix", "url \| title ", "postfix"]
I have tried some regular expressions but none of these gives desired result.
string.split(/[^\\]\|/) //["", "", "prefi", "$url \| $titl", " postfix"]
It looks like this is only possible with negative lookbacks but I could not get one to work

Another solution:
"1|test pattern|prefix|url \\| title |postfix"
.replace(/([^\\])\|/g, "$1$1|")
.split(/[^\\]\|/);
That said, you'll need to escape your backslash in the initial string with another backslash to make it work:
"1|test pattern|prefix|url \\| title |postfix"
^
Working demo available here.

Unfortunately Javascript does not support lookbehinds. I see no easy solution but the following might be suitable as workaround:
// use two backslashes in your string!
var string = '1|test pattern|prefix|url \\| title |postfix';
// create an arbitrary unique substitute character
var sub = "-";
string.replace(/\\\|/g,sub).split(/\|/);
/* replace the substituted character again in your array of strings */
Alternatively you could use something like this:
string.split(//\|\b//)
However this might fail in some circumstances when there are whitespaces involved.

Instead of using split() you could match all occurences that you're interested in:
var rx = /([^\\\|]|\\\|?)+/gi, item, items = [];
while (item = rx.exec(str)) {
items.push(item[0]);
}
See it in action in the Fiddle

'foo|bar\\|baz'.match(/(\\\||[^|])+/g)
This finds all sequences of characters that comprise the escaped splitting character or any character that isn't the splitting character.

What does this JS do?

var passwordArray = pwd.replace(/\s+/g, '').split(/\s*/);
I found the above line of code is a rather poorly documented JavaScript file, and I don't know exactly what it does. I think it splits a string into an array of characters, similar to PHP's str_split. Am I correct, and if so, is there a better way of doing this?

it replaces any spaces from the password and then it splits the password into an array of characters.
It is a bit redundant to convert a string into an array of characters,because you can already access the characters of a string through brackets(.. not in older IE :( ) or through the string method "charAt" :
var a = "abcdefg";
alert(a[3]);//"d"
alert(a.charAt(1));//"b"

It does the same as: pwd.split(/\s*/).
pwd.replace(/\s+/g, '').split(/\s*/) removes all whitespace (tab, space, lfcr etc.) and split the remainder (the string that is returned from the replace operation) into an array of characters. The split(/\s*/) portion is strange and obsolete, because there shouldn't be any whitespace (\s) left in pwd.
Hence pwd.split(/\s*/) should be sufficient. So:
'hello cruel\nworld\t how are you?'.split(/\s*/)
// prints in alert: h,e,l,l,o,c,r,u,e,l,w,o,r,l,d,h,o,w,a,r,e,y,o,u,?
as will
'hello cruel\nworld\t how are you?'.replace(/\s+/g, '').split(/\s*/)

The replace portion is removing all white space from the password. The \\s+ atom matches non-zero length white spcace. The 'g' portion matches all instances of the white space and they are all replaced with an empty string.

We Keep Coding

JavaScript is the programming language of the Web.

Split string by HTML entities? - javascript

My string contain a lot of HTML entities, like this "Hello <everybody> there" And I want to split it by HTML entities into this : Hello everybody there Can anybody suggest me a way to do this please? May be using Regex?

var a = str.split(/\&[#a-z0-9]+\;/); should do it, although you'll end up with empty slots in the array when you have two entities next to each other.

split(/&.*?;(?=[^&]|$)/) and cut the last and first result: ["", "Hello", "everybody", "there", ""]

Related

How can I include the delimiter with regex String.split()?

Extract numeric and text parts of a string, in varying formats

How to write regex for this javascript string

Javascript - regular expression to split string on unescaped character, e.g. | but ignore \|

What does this JS do?

Categories

Resources