Splitting end of text at period creates empty string - javascript

Given the following text
var text="unicorns! and rainbows? and, cupcakes.Hello this is splitting by sentences. However, I am not sure.";
I want to split at every period, there is a period at the end of the sentence and it splits it into an empty string as shown.
(4) ["unicorns! and rainbows? and, cupcakes", "Hello this is splitting by sentences", " However, I am not sure", ""]
What is a good way to split at the period using . but accounting for the end of the text?

You can use .filter(Boolean) to strip out any empty strings, like so:
var text="unicorns! and rainbows? and, cupcakes.Hello this is splitting by sentences. However, I am not sure.";
var splitText = text.split(".");
var nonEmpty = splitText.filter(Boolean);
// var condensed = text.split(".").filter(Boolean);
console.log(nonEmpty);
It may seem like a strange way to do it, but it's easy/efficient, and the concept works like this:
var arr = ["foo", "bar", "", "baz", ""];
var nonEmpty = arr.filter(function (str) {
return Boolean(str);
});
This uses the power of coercion to determine whether the string is empty or not. The only value of a string that will coerce to false is, in fact, an empty string "". All other string values will coerce to true. That is why we can use the Boolean constructor to check whether a string is empty or not.
Additionally, if you want to trim the leading/trailing whitespace off of each sentence, you can use the .trim() method, like so:
var text="unicorns! and rainbows? and, cupcakes.Hello this is splitting by sentences. However, I am not sure.";
var nonEmpty = text.split(".").filter(Boolean).map(str => str.trim());
console.log(nonEmpty);

That's how String#split works (and it's kind of logical that it is). There is nothing after the . in the string so it should be an empty string. If you want to get rid of the empty strings in the array you can filter them out using Array#filter (using an arrow function to mke it simple):
var result = text.split(".").filter(s => s); // an empty string is falsy so it will be excluded
Or use String#match with a simple regex in one go like:
var result = text.match(/[^.]+/g); // matches any sequence of character that are not a '.'
Example:
var text="unicorns! and rainbows? and, cupcakes.Hello this is splitting by sentences. However, I am not sure.";
var resultFilter = text.split(".").filter(x => x);
var resultMatch = text.match(/[^.]+/g);
console.log("With filter:", resultFilter);
console.log("With match:", resultMatch);

Adding filter(Boolean) to split is certainly a workaround, but the problem can be handled directly (and flexibly) by giving a regex to split.
For example, you can split on a regex that ignores periods completely or one that preserves all periods (or other punctuation marks):
const text = "unicorns! and rainbows? and, cupcakes.Hello this is splitting by sentences. However, I am not sure.";
// discard periods
console.log(text.match(/[^.]+/g));
// discard periods and leading whitespace
console.log([...text.matchAll(/(.+?)(?:\.\s*)/g)].map(e => e[1]));
// keep periods
console.log(text.match(/(.+?)\./g));
// keep periods periods but trim whitespace
console.log([...text.matchAll(/(.+?\.)\s*/g)].map(e => e[1]));
// discard various sentence-related punctuation
console.log(text.match(/[^.?!]+/g));

Related

Regex to match string in a sentence

I am trying to find a strictly declared string in a sentence, the thread says:
Find the position of the string "ten" within a sentence, without using the exact string directly (this can be avoided in many ways using just a bit of RegEx). Print as many spaces as there were characters in the original sentence before the aforementioned string appeared, and then the string itself in lowercase.
I've gotten this far:
let words = 'A ton of tunas weighs more than ten kilograms.'
function findTheNumber(){
let regex=/t[a-z]*en/gi;
let output = words.match(regex)
console.log(words)
console.log(output)
}
console.log(findTheNumber())
The result should be:
input = A ton of tunas weighs more than ten kilograms.
output = ten(ENTER)
You could try a regex replacement approach, with the help of a callback function:
var input = "A ton of tunas weighs more than ten kilograms.";
var output = input.replace(/\w+/g, function(match, contents, offset, input_string)
{
if (!match.match(/^[t][e][n]$/)) {
return match.replace(/\w/g, " ");
}
else {
return match;
}
});
console.log(input);
console.log(output);
The above logic matches every word in the input sentence, and then selectively replaces every word which is not ten with an equal number of spaces.
You can use
let text = 'A ton of tunas weighs more than ten kilograms.'
function findTheNumber(words){
console.log( words.replace(/\b(t[e]n)\b|[^.]/g, (x,y) => y ?? " ") )
}
findTheNumber(text)
The \b(t[e]n)\b is basically ten whole word searching pattern.
The \b(t[e]n)\b|[^.] regex will match and capture ten into Group 1 and will match any char but . (as you need to keep it at the end). If Group 1 matches, it is kept (ten remains in the output), else the char matched is replaced with a space.
Depending on what chars you want to keep, you may adjust the [^.] pattern. For example, if you want to keep all non-word chars, you may use \w.

How to match bold markdown if it isn't preceded with a backslash?

I'm looking to match bolded markdown. Here are some examples:
qwer *asdf* zxcv matches *asdf*
qwer*asdf*zxcv matches *asdf*
qwer \*asdf* zxcv does not match
*qwer* asdf zxcv matches *qwer*
A negative look behind like this (?<!\\)\*(.*)\* works.
Except there is no browser support in Firefox, so I cannot use it.
Similarly, I can get very close with (^|[^\\])\*(.*)\*
The issue is that there are two capture groups, and I need the index of the second capture group, and Javascript only returns the index of the first capture group. I can bandaid it in this case by just adding 1, but in other cases this hack will not work.
My reasoning for doing this is that I'm trying to replace a small subset of Markdown with React components. As an example, I'm trying to convert this string:
qwer *asdf* zxcv *123*
Into this array:
[ "qwer ", <strong>asdf</strong>, " zxcv ", <strong>123</strong> ]
Where the second and fourth elements are created via JSX and included as array elements.
You will also need to take into account that when a backslash occurs before an asterisk, it may be one that is itself escaped by a backslash, and in that case the asterisk should be considered the start of bold markup. Except if that one is also preceded by a backslash,...etc.
So I would suggest this regular expression:
((?:^|[^\\])(?:\\.)*)\*((\\.|[^*])*)\*
If the purpose is to replace these with tags, like <strong> ... </strong>, then just use JavaScript's replace as follows:
let s = String.raw`now *this is bold*, and \\*this too\\*, but \\\*this\* not`;
console.log(s);
let regex = /((?:^|[^\\])(?:\\.)*)\*((\\.|[^*])*)\*/g;
let res = s.replace(regex, "$1<strong>$2</strong>");
console.log(res);
If the bolded words should be converted to a React component and stored in an array with the other pieces of plain text, then you could use split and map:
let s = String.raw`now *this is bold*, and \\*this too\\*, but \\\*this\* not`;
console.log(s);
let regex = /((?:^|[^\\])(?:\\.)*)\*((?:\\.|[^*])*)\*/g;
let res = s.split(regex).map((s, i) =>
i%3 === 2 ? React.createComponent("strong", {}, s) : s
);
Since there are two capture groups in the "delimiter" for the split call, one having the preceding character(s) and the second the word itself, every third item in the split result is a word to be bolded, hence the i%3 expression.
This should do the trick:
/(?:^|[^\\])(\*[^*]+[^\\]\*)/
The only capturing group there is the string surrounded by *'s.

RegEx for replacing punctuation excluding negative numbers

Currently, to remove punctuation from a string, I use:
export function scrubPunctuation(text) {
let reg = /\b[-.,()&$#![\]{}"']+\B|\B[-.,()&$#![\]{}"']+\b/g;
return text.replace(reg, "");
}
but this also removes -1, where - is not so much "punctuation" as part of a numerical value.
How do I solve this problem?
Example use case:
I have take a string from a user that might look like this:
const userStr = " I want something, sort of, that has at least one property < -1.02 ? "
Currently, my approach is to first trim the string to remove the leading / trailing white space.
Then I "scrub" punctuation from the string.
From the example of userStr above, I might eventually parse out (via some unrelated to regex):
const relevant = ["something", "at least one", "<", "-1.02"]
In general, non-numeric punctuation is irrelevant.
Split your first character set. Remove the hyphen from the first set and add a Negative lookahead for the hyphen:
[-]+(?![0-9]) \\a Hyphen not followed by a number
And the full expression:
\b[-]+(?![0-9])|[-.,()&$#![\]{}"']+\B|\B[.,()&$#![\]{}"']+\b
Here is a working example
If you don't want the minus sign or the dot or comma removed form the digits, one option might be to capture what you want to keep (in this case a digit with an optional decimal part) and match what you want to remove.
(-?\d+(?:[.,]\d+)*)|[-.,()&$#![\]{}"']+
Regex demo
let pattern = /(-?\d+(?:[.,]\d+)*)|[-.,()&$#![\]{}"']+/g;
let str = "This is -4, -55 or -4,00.00 (test) 5,00";
let res = str.replace(pattern, "$1");
console.log(res);
something like /[,?!.']/g could do the job and you add whatever you want
const text = "bar........,foo,????!-1'poo!!!?'";
const res = text.replace(/[,?!.']/g, "")
console.log(res)
I would split it into two.
First I would remove everything but alphanumeric and -.
/[^a-z0-9\-\s\n]/gi
It is a little more readable than your method and should give the same result unless there is some character you want to keep (like whitespace \s and newline \n).
To get rid of the punctuation "-", I would use:
/-(\d*)/g
So altogether:
export function scrubPunctuation(text) {
let reg = /[^a-z0-9\-\s\n]/gi;
let reg2 = /-(\d*)/g;
text = text.replace(reg, "");
return text.replace(reg2, "$1");
}
Haven't tested it, but it should work

splitting string is returning full string

I want to create a small script which determines how many words are in a paragraph and then divides the paragraph depending on a certain length. My approach was to split the paragraph using split(), find out how many elements are in the array and then output some of the elements into one paragraph and the rest into another.
var para = document.getElementById('aboutParagraph').innerHTML;
var paraElements = para.split();
var paraLength = paraElements.length;
if(paraLength >= 500){
}
console.log(paraElements);
when I use this code paraElements is being returned in an array where the first element is the entire string.
Sof for example if the paragraph were "this is a paragraph" paraElements is being returned as: ["this is a paragraph"], with a a length of 1. Shouldn't it be ["this", "is", "a", "paragraph"]?
var str = "this is a paragraph";
var ans = str.split(' ');
console.log(ans);
You need to use split(' ') with this format. Use ' ', notice space there. You were not passing any parameter by which to split.
The split() method splits a string at a delimiter you specify (can be a literal string, a reference to a string or a regular expression) and then returns an array of all the parts. If you want just one part, you must pass the resulting array an index.
You are not supplying a delimiter to split on, so you are getting the entire string back.
var s = "This is my test string";
var result = s.split(/\s+/); // Split everywhere there is one or more spaces
console.log(result); // The entire resulting array
console.log("There are " + result.length + " words in the string.");
console.log("The first word is: " + result[0]); // Just the first word
You are missing the split delimiter for a space. Try this:
var para = document.getElementById('aboutParagraph').innerHTML;
var paraElements = para.split(' ');
var paraLength = paraElements.length;
if(paraLength >= 500){
}
console.log(paraElements);
The split() will return an array, if you don't pass in a delimiter as an argument it would encapsulate the whole string into one element of an array.
You can break your words on spaces, but you may want to also consider tabs and newlines. For that reason, you could use some regex /\s+/ which will match on any whitespace character.
The + is used so that it treats all consecutive whitespace characters as one delimiter. Otherwise a string with two spaces, like foo bar would be treated as three words with one being an empty string ["foo", "", "bar"] (the plus makes it ["foo", "bar"] as expected).
var para = document.getElementById('aboutParagraph').innerHTML;
var paraElements = para.split(/\s+/); // <-- need to pass in delimiter to split on
var paraLength = paraElements.length;
if (paraLength >= 500) {}
console.log(paraLength, paraElements);
<p id="aboutParagraph">I want to create a small script which determines how many words are in a paragraph and then divides the paragraph depending on a certain length. My approach was to split the paragraph using split(), find out how many elements are in the array and then output some of the elements into one paragraph and the rest into another.</p>

Best way to manipulate and cut a string using character matching?

So in my example, we have strings that look like this:
CP1_ctl05_RCBPAThursdayStartTimePicker_0_dateInput
CP1_ctl05_RCBPAFridayStartTimePicker_3_dateInput
CP1_ctl05_RCBPAMondayStartTimePicker_1_dateInput
The task is to extract the days of the week from the string.
I already figured you can trim the first set of characters CP1_ctl05_RCBPA as they will always have the same length and will always occur in the same position. Using string.substr(15), I was able to reduce the string to FridayStartTimePicker_3_dateInput but I am not sure how to approach deleting the rest of the suffixal garbage text.
I was thinking about trimming the end by finding the first occurring y (as it will only occur in days of the week in this case) and slicing off the end up until that point, but I am not sure about how to approach slicing off a part of a string like this.
You can use regex to extract them. As every day ends with a y, and no day has a y in between, you can simply use that as delimiter
const regex = /\w{15}(\w+y).*/g;
const str = `CP1_ctl05_RCBPAThursdayStartTimePicker_0_dateInput`;
const subst = `\$1`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);
Instead of deleting unwanted parts, you could just match what you want.
The following regex ^.{15}(\w+?y) matches 15 any character from the begining of the string then matches and capture in group 1 one or more word character not greedy then the letter y. It is mandatory to use not greedy ? unless it will match until the last y that exists in the string.
We then just have to get the content of the first group and assign to variable day
var test = [
'CP1_ctl05_RCBPAThursdayStartTimePicker_0_dateInput', 'CP1_ctl05_RCBPAFridayStartTimePicker_3_dateInput', 'CP1_ctl05_RCBPAMondayStartTimePicker_1_dateInput'
];
console.log(test.map(function (a) {
return a.match(/^.{15}(\w+?y)/)[1]
}));

Categories