Regex: How to (better) optimize text from/for messages - javascript

Since this question does not contain a specific question on regex but more on it's design/approach, it might take a while to understand the requirements and their dependencies. I have done everything I can to make it as easy as possible with this fully working yet not elegant solution(deadlink).
I need to optimize text in a messaging platform that is being created/edited by others and may have to be sanitized with regex. All optimizations need to be done with one single regex, since these happen often and are quite expensive (or am I wrong on this?). Furthermore the regex needs to be language-agnostic (at least compatible with Javascript and Php). Last but not least, the optimized text must not contain (additional) Html as it is used in a text-only environment.
Requirements
Optimize lines
Remove single lines
Do not remove single lines that end with two|no spaces (thus allow editors to force a newline)
Do not remove empty lines (double line breaks)
Do not remove single lines that start with symbol|char|digit|entity+space (raw lists)
Condense multiple consecutive empty lines (double line breaks) to one double line break
Optimize spaces
Remove excess spaces
Do not remove spaces at the end of a sentence
Optimize comments
Remove single line comments
Do not remove trailing comments
Overall
Preserve Html and do not add Html
Intermediate solution
So far, my solution is to combine 4 regexes which 'match' my requirements and get replaced by a single space:
Matches single lines while leaving empty lines intact and preserving raw lists: \n(?!\n|[-_.○•♥→›>+%\/*~=] |[a-zA-Z_1-9+][\.|\)|\:|\*]) (the length is due to several list-style-types I want to support)
Matches excess empty lines: (\n+)(?=\n\n)
Matches excess spaces: +
Matches single line comments (while ignoring trailing comments): ^\n?\/\/ .+\n
To make the optimization rather inexpensive, I concatenate them with | to one single regex which I can use in Javascript (as well as Php).
r = new RegExp(" \n(?!\n|[-_.○•♥→›>+%\/*~=] |[a-zA-Z_1-9+][.):*] )|(\n+)(?=\n\n)| + |^\n?\/\/ .+\n", "gm");
i = document.getElementById("input").innerHTML;
p = " ";
o = i.replace(r, p);
document.getElementById("output").innerHTML = o;
#input, #output { width: 100%; height: 88vh; }
#input { display: none; } #output { border: none; }
<textarea id="input">
MAKE PARAGRAPHS
This is the first paragraph.
Some sentences end with newlines.
Some don't. We need to cope with that.
This is the second paragraph.
It contains some unnecessary spaces.
Even at the end of a line.
This is the third paragraph.
Some sentences end with question- and exclamation-marks.
I hope that is ok for you. Is it? That's great! Really.
KEEP LISTS
This is an unordered list, starting with a minus+space:
- This is the first item.
- This is the second item.
- This is the third item.
Here is an unordered list, starting with entity|symbol+space:
• This is the second item.
> This is the third item. // Works in php only
* This is the fifth item.
This is a (manually) ordered lists, starting with char|digit+entity+space:
1. This is the first item.
b) This is the second item.
3: This is the third item.
Here is a mathematical list, starting with operators:
+ Plus
- Minus
% Percentage
/ Division
* Multiply
~ Like
= Equal
These are (manually) ordered lists, which are not summed up because they do not end with a space:
1 This is the first item.
b This is the second item.
I like the third item.
First: This works.
Second: It works great.
Third: That is nice!
KEEP HTML
The input text may contain Html.
The output text must simply keep it for further processing.
The output must not add Html as it is processed in a text-only environment.
I know this sounds stupid, but it isn't.
REMOVE COMMENTS
Single/whole line comments are being removed.
// Sources
// Removing single lines: https://regex101.com/r/qU1eP8/5
// Removing comments: https://www.perlmonks.org/?node_id=996552
// Tests
// Dialog: https://api.sefzig.net/dialog/test/regex/
// Jsbin: https://jsbin.com/goromad/edit?output
// Regex101: https://regex101.com/r/Xz5atA/2
// Regexr: https://regexr.com/45svm
Thank you, regex ♥ // Problem solved
~Fin~
</textarea>
<textarea id="output"><!-- Press "Run" --></textarea>
My request
Since I am not a regex-expert and my approach feels cumbersome, I'd like to hear your suggestions. I know Regex is expensive and everything can be done better.
You might wonder about a few details I haven't mentioned here for the sake of clarity. You also might want to test my Regexes. This is why I have set up a sandbox, isolating the requirements (Regexes), containing an example text with all use-cases as well as a detailed description:
https://api.sefzig.net/dialog/test/regex/(deadlink)
In case you want to use the features of great tools out there, here you go:
Regexr: https://regexr.com/45svm
Regex101: https://regex101.com/r/Xz5atA/2
Jsbin: https://jsbin.com/goromad/edit?output
Thank you
for helping me straighten this important feature of my messaging platform! Please feel free to enhance my approach, suggest an alternative or use the results in your own project ♥
This is my first question on stack overflow. I have researched a lot. Please bear with me if I have done anything wrong and help me fix that.

Related

Regex for finding element tagname and attributes "skips" attributes

I'm trying to make a regular expression that finds the tagnames and attributes of elements. For example, if I have this:
<div id="anId" class="aClass">
I want to be able to get an array that looks like this:
["(full match)", "div", "id", "anId", "class", "aClass"]
Currently, I have the regex /<(\S*?)(?: ?(.*?)="(.*?)")*>/, but for some reason it skips over every attribute except for the last one.
var str = '<div id="anId" class="aClass">'
console.log(str.match(/<(\S*)(?: ?(.*?)="(.*?)")*>/));
Regex101: https://regex101.com/r/G0ncwF/2
Another odd thing: if I remove the * after the non-capture group, the capture group in quotes seems to somehow "forget" that it's lazy. (Regex101: https://regex101.com/r/C0UwI8/2)
Why does this happen, and how can I avoid it? I couldn't find any questions/answers that helped me (Python re.finditer match.groups() does not contain all groups from match looked promising, but didn't seem help me at all)
(note: I know there are better ways to get the attributes, I'm just experimenting with regex)
UPDATE:
I've figured out at least why the quantifiers seem to "forget" that they're lazy. It's actually just that the regex is trying to match all the way to the angle brackets. I suppose I must have been thinking that the non-capturing group was "insulating" everything and preventing that from happening, and I didn't see it was still lazy because there was only one angle bracket for it to find.
var str = '"foo" "bar"> "baz>"'
console.log("/\".*?\"/ produces ", str.match(/".*?"/), ", finds first quote, finds text, lazily stops at second quote");
console.log("/\".*?\">/ produces ", str.match(/".*?">/), ", finds first quote, finds text, sees second quote but doesn't see angle bracket, keeps going until it sees \">, lazily stops");
So at least that's solved. But I still don't understand why it skips over every attribute but the last one.
And note: Other regexes using different tricks to find the attributes are nice and all, but I'm mostly looking to learn why my regex skips over the attributes, so I can maybe understand regex a bit better.
Playing along with your experimentation you could do this: Instead of scanning for what you want, you can scan for what you don't want, and then filter it out:
const html = '<div id="anId" class="aClass">';
const regex = /[<> ="]/;
let result = html.split(regex).filter(Boolean);
console.log('result: '+JSON.stringify(result));
Output:
result: ["div","id","anId","class","aClass"]
Explanation:
regex /[<> ="]/ lists all chars you don't want
.split(regex) splits your text along the unwanted chars
.filter(Boolean) gets rid of the unwanted chars
Mind you this has flaws, for example it will split incorrectly for html <div id="anId" class="aClass anotherClass">, e.g a space in an attribute value. To support that you could preprocess the html with another regex to escape spaces in quotes, then postprocess with another regex to restore the spaces...
Yes, an HTML parser is more reliable for these kind of tasks.

Parse JS code for comments

I have a small NodeJS program that I use to extract code comments from files I point it to. It mostly works, but I'm having some issues dealing with it misinterpreting certain JS strings (glob patterns) as code comments.
I'm using the regex [^:](\/\/.+)|(\/\*[\W\w\n\r]+?\*\/) to parse the following test file:
function DoStuff() {
/* This contains the value of foo.
Foo is used to display "foo"
via http://stackoverflow.com
*/
this.foo = "http://google.com";
this.protocolAgnosticUrl = "//cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/core.js";
//Show a message about foo
alert(this.foo);
/// This is a triple-slash comment!
const globPatterns = [
'path/to/**/*.tests.js',
'!my-file.js',
"!**/folder/*",
'another/path/**/*.tests.js'
];
}
Here's a live demo to help visualize what is and is not properly captured by the regex: https://regex101.com/r/EwYpQl/1
I need to be able to only locate the actual code comments here, and not the comment-like syntax that can sometimes appear within strings.
I have to agree with the comments that for most cases it is better to use a parser, even when a RegExp can do the job for a specific and well defined use case.
The problem is not that you can't make it work for that very specific use case even thought there are probably plenty of edge cases that you don't really care about, nor have to, but that may break that solution. The actual problem is that if you start building around your sub-optimal solution and your requirements evolve overtime, you will start to patch those as they appear. Someday, you may find yourself with an extensive codebase full of patches that doesn't scale anymore and the only solution will probably be to start from scratch.
Anyway, you have been warned by a few of us, and is still possible that your use case is really that simple and will not change in the future. I would still consider moving from RegExp to a parser at some point, but maybe you can use this meanwhile:
(^ +\/\/(.*))|(["'`]+.*["'`]+.*\/\/(.*))|(["'`]+.*["'`]+.*\/\*([\W\w\n\r]+?)\*\/)|(^ +\/\*([\W\w\n\r]+?)\*\/)
Just in case, I have added a few other cases, such as comments that come straight after some valid code:
Edit to prove the first point and what is being said in the comments:
I have just answered this with the previous RegExp that was solving just the issue that you pointed out in your question (your RegExp was misinterpreting strings containing glob patterns as code comments).
So, I fixed that and I even made it able to match comments that start in the same line as a valid (non-commented) statement. Just a moment after posting that I notice that this last feature will only work if that statement contains a string.
This is the updated version, but please, keep in mind that this is exactly what we are warning you about...:
(^[^"'`\n]+\/\/(.*))|(["'`]+.*["'`]+.*\/\/(.*))|(["'`]+.*["'`]+.*\/\*([\W\w\n\r]+?)\*\/)|(^[^"'`\n]+\/\*([\W\w\n\r]+?)\*\/)
How does it work?
There are 4 main groups that compose the whole RegExp, the first two for single-line comments and the next two for multi-line comments:
(^[^"'`\n]+//(.*))
(["']+.*["']+.//(.))
(["']+.*["']+.*/*([\W\w\n\r]+?)*/)
(^[^"'`\n]+/*([\W\w\n\r]+?)*/)
You will see there are some repeated patterns:
^[^"'`\n]+: From the start of a line, match anything that doesn't include any kind of quote or line break.
` is for ES2015 template literals.
Line breaks are excluded as well to prevent matching empty lines.
Note the + will prevent matching comments that are not padded with at least one space. You can try replacing it with *, but then it will match strings containing glob patterns again.
["']+.*["']+.*: This is matching anything that is between quotes, including anything that looks like a comment but it's part of a string. Whatever you match after, it will be outside that string, so using another group you can match comments.

How to detect and remove unwanted lines from a string?

I am working on a project in which i have to extract text data from a PDF.
I am able to extract text from the PDF, but extracted text sometimes contains lines which i would like to strip off from it.
Here's and example of unwanted lines -
ISBN 0-7225-3293-8. = CONTENTS = Part One Part Two Epilogue
Page 1 / 94
And, here's an example of good line (which i'd like to keep) -
Dusk was falling as the boy arrived with his herd at an abandoned church.
I wanted to sleep a little longer, he thought. He had had the same dream that night as a week ago
Different PDFs can give out different unwanted lines.
How can i detect them ?
Option 1 - Give the computer a rule: If you are able to narrow down what content it is that you would like to keep, the obvious criteria that sticks out to me is the exclusion of special characters, then you can filter your results based on this.
So let's say you agree that all "good lines" will be without special characters ('/', '-', and '=') for example, if a line DOES contain one of these items, you know you can remove it from the content you are keeping. This could be done in a for loop containing an if-then condition that looks something like this..
var lineArray = //code needed to make each line of the file an element of the array
For (cnt = 0; cnt < totalLines; cnt++)
{
var line = lineArray[cnt];
if (line.contains("/") || line.contains("-") || line.contains("="))
lineArray[cnt] = "";
}
At the end of this code you could simply get all the text within the array and it would no longer contain the unwanted lines. If there are unwanted lines however, that are virtually indistinguishable by characters, length, positioning etc. the previous approach begins to break down on some of the trickier lines.
This is because there is no rule you can give the computer to distinguish between the good and the bad without giving it a brain such as yours that recognizes parts of speech and sentence structure. In which case you might consider option 2, which is just that.
Option 2- Give the computer a brain: Given that the text you want to remove will more or less be incoherent documentation based on what you have shown us, an open source (or purchased) natural language processor may be what you are looking for.
I found a good beginner's intro at http://myreaders.info/10_Natural_Language_Processing.pdf with some information that might be of use to you. From the source,
"Linguistics is the science of language. Its study includes:
sounds (phonology),
word formation (morphology),
sentence structure (syntax),
meaning (semantics), and understanding (pragmatics) etc.
Syntactic Analysis : Here the analysis is of words in a sentence to know the grammatical structure of the sentence. The words are transformed into structures that show how the words relate to each others. Some word sequences may be rejected if they violate the rules of the language for how words may be combined. Example: An English syntactic analyzer would reject the sentence say : 'Boy the go the to store.' "
Using some sort of NLP, you can discover whether a given section of text contains a sentence or some incoherent rambling. This test could then be used as a filter in your program for what you would like to keep or remove.
Side note- As it appears your sample text is not just sentences but literature, sometimes characters will speak in sentence fragments as part of their nature given by the author. In this case, you could add a separate condition that if the text is contained within two quotations and has no special characters, you want to keep the text regardless.
In the end NLP may be more work than you require or that you want to do, in which case Option 1 is likely going to be your best bet. On the other hand, it may be just the thing you are looking for. Whatever the case or if you decide you need some combination of the two, best of luck! I hope this answer helps.

Markup regular expression help, double vs single symbols

Background
I have burned myself out looking for this answer. The closest code I could find that works was from Stack Edit specifically the Markdown.Converter.js script; copied below. This is a pretty heavy hitting regular expression though, my regex for finding ** for example happens in almost 1/5 of the steps and I don't need this much extra support.
function _DoItalicsAndBold(text) {
// <strong> must go first:
text = text.replace(/([\W_]|^)(\*\*|__)(?=\S)([^\r]*?\S[\*_]*)\2([\W_]|$)/g,"$1<strong>$3</strong>$4");
text = text.replace(/([\W_]|^)(\*|_)(?=\S)([^\r\*_]*?\S)\2([\W_]|$)/g,"$1<em>$3</em>$4");
return text;
}
Question
I'm trying to make my own very simple markdown script that makes these transformations:
* ---> Italics
** ---> Bold
__ ---> Underline
So far I can find all uses of ** (two stars, bold text) with this regex:
/(\*\*)(?:(?=(\\?))\2.)*?\1/g
However I can not for the life of me figure out how to match only * (single star, italicized text) with one regular expression. If I decide to go further I may have to distinguish between _ and __ as well.
Can someone point me in the right direction on how to properly write the regular expressions that will do this?
Update / Clarifty of OP's Question
I am aware of parser's and I am afraid that this question is going to be derailed from the point. I am not asking for parser help (but I do welcome and appreciate it) I am looking specifically for regular expression help. If this helps people get away from parser answers here is another example. Lets say I have an app that looks for strings inside double quotes and pulls them out to make tags or something. I want to avoid troll users trying to mess things up or sneak things by me so if they use double double quotes I should just ignore it and not bother making a tag out of it. Example:
In this "sentence" my regex would match "sentence" and use other code I'm not showing you to pull out only the word: sentence.
Now if someone does double double quotes I just ignore it because no match was found. Meaning the inner word should not be found as a match in this instance.
In this ""sentence"" I have two double quotes around the word sentence and it should be completely ignored now. I don't even care about ignoring the outer double quotes and matching on the inner ones. I want no match in this case.

Is there a way to automatically control orphaned words in an HTML document?

I was wondering if there's a way to automatically control orphaned words in an HTML file, possibly by using CSS and/or Javascript (or something else, if anyone has an alternative suggestion).
By 'orphaned words', I mean singular words that appear on a new line at the end of a paragraph. For example:
"This paragraph ends with an undesirable orphaned
word."
Instead, it would be preferable to have the paragraph break as follows:
"This paragraph no longer ends with an undesirable
orphaned word."
While I know that I could manually correct this by placing an HTML non-breaking space ( ) between the final two words, I'm wondering if there's a way to automate the process, since manual adjustments like this can quickly become tedious for large blocks of text across multiple files.
Incidentally, the CSS2.1 properties orphans (and widows) only apply to entire lines of text, and even then only for the printing of HTML pages (not to mention the fact that these properties are largely unsupported by most major browsers).
Many professional page layout applications, such as Adobe InDesign, can automate the removal of orphans by automatically adding non-breaking spaces where orphans occur; is there any sort of equivalent solution for HTML?
You can avoid orphaned words by replacing the space between the last two words in a sentence with a non-breaking space ( ).
There are plugins out there that does this, for example jqWidon't or this jquery snippet.
There are also plugins for popular frameworks (such as typogrify for django and widon't for wordpress) that essentially does the same thing.
I know you wanted a javascript solution, but in case someone found this page a solution but for emails (where Javascript isn't an option), I decided to post my solution.
Use CSS white-space: nowrap. So what I do is surround the last two or three words (or wherever I want the "break" to be) in a span, add an inline CSS (remember, I deal with email, make a class as needed):
<td>
I don't <span style="white-space: nowrap;">want orphaned words.</span>
</td>
In a fluid/responsive layout, if you do it right, the last few words will break to a second line until there is room for those words to appear on one line.
Read more about about the white-space property on this link: http://www.w3schools.com/cssref/pr_text_white-space.asp
EDIT: 12/19/2015 - Since this isn't supported in Outlook, I've been adding a non-breaking space between the last two words in a sentence. It's less code, and supported everywhere.
EDIT: 2/20/2018 - I've discovered that the Outlook App (iOS and Android) doesn't support the entity, so I've had to combine both solutions: e.g.:
<td>
I don't <span style="white-space:nowrap;">want orphaned words.</span>
</td>
In short, no. This is something that has driven print designers crazy for years, but HTML does not provide this level of control.
If you absolutely positively want this, and understand the speed implications, you can try the suggestion here:
detecting line-breaks with jQuery?
That is the best solution I can imagine, but that does not make it a good solution.
I see there are 3rd party plugins suggested, but it's simpler to do it yourself. if all you want to do is replace the last space character with a non-breaking space, it's almost trivial:
const unorphanize = (str) => {
let iLast = str.lastIndexOf(' ');
let stArr = str.split('');
stArr[iLast] = ' ';
return stArr.join('')
}
I suppose this may miss some unique cases but it's worked for all my use cases. the caveat is that you can't just plug the output in where text would go, you have to set innerHTML = unorphanize(text) or otherwise parse it
If you want to handle it yourself, without jQuery, you can write a javascript snippet to replace the text, if you're willing to make a couple assumptions:
A sentence always ends with a period.
You always want to replace the whitespace before the last word with
Assuming you have this html (which is styled to break right before "end" in my browser...monkey with the width if needed):
<div id="articleText" style="width:360px;color:black; background-color:Yellow;">
This is some text with one word on its own line at the end.
<p />
This is some text with one word on its own line at the end.
</div>
You can create this javascript and put it at the end of your page:
<script type="text/javascript">
reformatArticleText();
function reformatArticleText()
{
var div = document.getElementById("articleText");
div.innerHTML = div.innerHTML.replace(/\S(\s*)\./g, " $1.");
}
</script>
The regex simply finds all instances (using the g flag) of a whitespace character (\S) followed by any number of non-whitespace characters (\s) followed by a period. It creates a back-reference to the non-white-space that you can use in the replace text.
You can use a similar regex to include other end punctuation marks.
If third-party JavaScript is an option, one can use typogr.js, a JavaScript "typogrify" implementation. This particular filter is called, unsurprisingly, Widont.
<script src="https://cdnjs.cloudflare.com/ajax/libs/typogr/0.6.7/typogr.min.js"></script>
<script>
document.body.innerHTML = typogr.widont(document.body.innerHTML);
</script>
</body>

Categories