I have a paragraph of some texts. I want to replace some words in that using wildcard.
The below is my Paragraph.
0.7% lower on the prospect of fresh restrictions that would deal a blow to hopes of a swift economic
recovery. <Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:nL4N2EC04Z</Origin>\n The 2,000-plus cases reported on Sunday was a shocker ,said Nicholas Mapa, ING
In this Para, I want to remove <Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:nL4N2EC04Z</Origin>\n
There are multiple paragraphs. But only the uncommon one is nL4N2EC04Z
All other words are common in those paragraphs .
<Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:(need_to_use_wild_card_here)</Origin>\n
I tried to replace one half.
My code
storyRef="<Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:";
storyRef.replace(storyRef," ")
But am stuck in replacing other parts.
It seems encoding problem. Try to use JSON.stringify to ensure that characters like < and some others does not read decoded.
You can improve Regex too to something like this π
storyRef.replace(/<Origin.*Origin>\\n/gm, ' ');
This example will start in <Origin, get all content between until Origin>\n.
const p = JSON.stringify('0.7% lower on the prospect of fresh restrictions that would deal a blow to hopes of a swift economic recovery. <Origin Href=\"StoryRef\">urn:newsml:reuters.com:*:nL4N2EC04Z</Origin>\n The 2,000-plus cases reported on Sunday was a shocker ,said Nicholas Mapa, ING');
const storyRef = JSON.parse(p.replace(/(<Origin.*Origin>\\n)+/gm, ' '));
console.log(storyRef);
Related
If you double-click English text in Chrome, the whitespace-delimited word you clicked on is highlighted. This is not surprising. However, the other day I was clicking while reading some text in Japanese and noticed that some words were highlighted at word boundaries, even though Japanese doesn't have spaces. Here's some example text:
γ©γγ§ηγγγγ¨γγ¨θ¦ε½γγ€γγ¬γδ½γ§γθζγγγγγγγζγ§γγ£γΌγγ£γΌζ³£γγ¦γγδΊγ γγ―θ¨ζΆγγ¦γγγ
For example, if you click on θζγ, Chrome will correctly highlight it as a single word, even though it's not a single character class (this is a mix of kanji and hiragana). Not all the highlights are correct, but they don't seem random.
How does Chrome decide what to highlight here? I tried searching the Chrome source for "japanese word" but only found tests for an experimental module that doesn't seem active in my version of Chrome.
So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese.
function tokenizeJA(text) {
var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
it.adoptText(text)
var words = []
var cur = 0, prev = 0
while (cur < text.length) {
prev = cur
cur = it.next()
words.push(text.substring(prev, cur))
}
return words
}
console.log(tokenizeJA('γ©γγ§ηγγγγ¨γγ¨θ¦ε½γγ€γγ¬γδ½γ§γθζγγγγγγγζγ§γγ£γΌγγ£γΌζ³£γγ¦γγδΊγ γγ―θ¨ζΆγγ¦γγγ'))
// ["γ©γ", "γ§", "ηγ", "γγ", "γ¨γγ¨", "θ¦ε½", "γ", "γ€", "γ", "γ¬", "γ", "δ½γ§γ", "θζγ", "γγγγ", "γγ", "ζ", "γ§", "γγ£γΌγγ£γΌ", "ζ³£", "γ", "γ¦", "γγδΊ", "γ γ", "γ―", "θ¨ζΆ", "γ", "γ¦", "γγ", "γ"]
I also made a jsfiddle that shows this.
The quality is not amazing but I'm surprised this is supported at all.
Based on links posted by JonathonW, the answer basically boils down to: "There's a big list of Japanese words and Chrome checks to see if you double-clicked in a word."
Specifically, v8 uses ICU to do a bunch of Unicode-related text processing things, including breaking text up into words. The ICU boundary-detection code includes a "Dictionary-Based BreakIterator" for languages that don't have spaces, including Japanese, Chinese, Thai, etc.
And for your specific example of "θζγ", you can find that word in the combined Chinese-Japanese dictionary shipped by ICU (line 255431). There are currently 315,671 total Chinese/Japanese words in the list. Presumably if you find a word that Chrome doesn't split properly, you could send ICU a patch to add that word.
It is still rudimentary (2022-11-27) but Google progresses very fast in the various fields of language parsing.
As of today's state of the code, Google Chrome broke |ηγ|γγ| and |ζ³£|γ|γ¦|γγδΊ|, both 'γγ' and 'γγδΊ' are odd lexically since both 'γγ' and 'γγ' (A) are usually used 'agglutinated' with the previous string 99,9% of the time (B) have very little meaning (frequency usage beyond the 10000th rank).
For Chinese and Japanese anyone can get better results with a vocabulary list of just 100,000 items (you add to the list as you read) that you organize from longest strings to shorter (single characters), for Chinese I set the length at 5 characters maximum, anything bigger is the name of an organization or such, for Japanese I set the maximum at 9 char length. Tonal languages have (65%) shorter words compared to non-tonal.
To parse a paragraph you launch a "do while" loop that starts from the first character and tries to find first the longest possible string in the vocabulary list, if that wasn't successful the search proceed towards the end of the list to shorter parts of words with less meaning, till it gets too simple letters or rare single-characters (you need to have all these single items, like, all 6,000 kanji/hanzi for daily reading).
You set a separator when you encounter punctuation or numbers and you skip to the next word.
It would be easier if I showed this at work but I don't know if people are interested and if I can post video links here.
Since this question does not contain a specific question on regex but more on it's design/approach, it might take a while to understand the requirements and their dependencies. I have done everything I can to make it as easy as possible with this fully working yet not elegant solution(deadlink).
I need to optimize text in a messaging platform that is being created/edited by others and may have to be sanitized with regex. All optimizations need to be done with one single regex, since these happen often and are quite expensive (or am I wrong on this?). Furthermore the regex needs to be language-agnostic (at least compatible with Javascript and Php). Last but not least, the optimized text must not contain (additional) Html as it is used in a text-only environment.
Requirements
Optimize lines
Remove single lines
Do not remove single lines that end with two|no spaces (thus allow editors to force a newline)
Do not remove empty lines (double line breaks)
Do not remove single lines that start with symbol|char|digit|entity+space (raw lists)
Condense multiple consecutive empty lines (double line breaks) to one double line break
Optimize spaces
Remove excess spaces
Do not remove spaces at the end of a sentence
Optimize comments
Remove single line comments
Do not remove trailing comments
Overall
Preserve Html and do not add Html
Intermediate solution
So far, my solution is to combine 4 regexes which 'match' my requirements and get replaced by a single space:
Matches single lines while leaving empty lines intact and preserving raw lists: \n(?!\n|[-_.ββ’β₯ββΊ>+%\/*~=] |[a-zA-Z_1-9+][\.|\)|\:|\*]) (the length is due to several list-style-types I want to support)
Matches excess empty lines: (\n+)(?=\n\n)
Matches excess spaces: +
Matches single line comments (while ignoring trailing comments): ^\n?\/\/ .+\n
To make the optimization rather inexpensive, I concatenate them with | to one single regex which I can use in Javascript (as well as Php).
r = new RegExp(" \n(?!\n|[-_.ββ’β₯ββΊ>+%\/*~=] |[a-zA-Z_1-9+][.):*] )|(\n+)(?=\n\n)| + |^\n?\/\/ .+\n", "gm");
i = document.getElementById("input").innerHTML;
p = " ";
o = i.replace(r, p);
document.getElementById("output").innerHTML = o;
#input, #output { width: 100%; height: 88vh; }
#input { display: none; } #output { border: none; }
<textarea id="input">
MAKE PARAGRAPHS
This is the first paragraph.
Some sentences end with newlines.
Some don't. We need to cope with that.
This is the second paragraph.
It contains some unnecessary spaces.
Even at the end of a line.
This is the third paragraph.
Some sentences end with question- and exclamation-marks.
I hope that is ok for you. Is it? That's great! Really.
KEEP LISTS
This is an unordered list, starting with a minus+space:
- This is the first item.
- This is the second item.
- This is the third item.
Here is an unordered list, starting with entity|symbol+space:
β’ This is the second item.
> This is the third item. // Works in php only
* This is the fifth item.
This is a (manually) ordered lists, starting with char|digit+entity+space:
1. This is the first item.
b) This is the second item.
3: This is the third item.
Here is a mathematical list, starting with operators:
+ Plus
- Minus
% Percentage
/ Division
* Multiply
~ Like
= Equal
These are (manually) ordered lists, which are not summed up because they do not end with a space:
1 This is the first item.
b This is the second item.
I like the third item.
First: This works.
Second: It works great.
Third: That is nice!
KEEP HTML
The input text may contain Html.
The output text must simply keep it for further processing.
The output must not add Html as it is processed in a text-only environment.
I know this sounds stupid, but it isn't.
REMOVE COMMENTS
Single/whole line comments are being removed.
// Sources
// Removing single lines: https://regex101.com/r/qU1eP8/5
// Removing comments: https://www.perlmonks.org/?node_id=996552
// Tests
// Dialog: https://api.sefzig.net/dialog/test/regex/
// Jsbin: https://jsbin.com/goromad/edit?output
// Regex101: https://regex101.com/r/Xz5atA/2
// Regexr: https://regexr.com/45svm
Thank you, regex β₯ // Problem solved
~Fin~
</textarea>
<textarea id="output"><!-- Press "Run" --></textarea>
My request
Since I am not a regex-expert and my approach feels cumbersome, I'd like to hear your suggestions. I know Regex is expensive and everything can be done better.
You might wonder about a few details I haven't mentioned here for the sake of clarity. You also might want to test my Regexes. This is why I have set up a sandbox, isolating the requirements (Regexes), containing an example text with all use-cases as well as a detailed description:
https://api.sefzig.net/dialog/test/regex/(deadlink)
In case you want to use the features of great tools out there, here you go:
Regexr: https://regexr.com/45svm
Regex101: https://regex101.com/r/Xz5atA/2
Jsbin: https://jsbin.com/goromad/edit?output
Thank you
for helping me straighten this important feature of my messaging platform! Please feel free to enhance my approach, suggest an alternative or use the results in your own project β₯
This is my first question on stack overflow. I have researched a lot. Please bear with me if I have done anything wrong and help me fix that.
I am working on a project in which i have to extract text data from a PDF.
I am able to extract text from the PDF, but extracted text sometimes contains lines which i would like to strip off from it.
Here's and example of unwanted lines -
ISBN 0-7225-3293-8. = CONTENTS = Part One Part Two Epilogue
Page 1 / 94
And, here's an example of good line (which i'd like to keep) -
Dusk was falling as the boy arrived with his herd at an abandoned church.
I wanted to sleep a little longer, he thought. He had had the same dream that night as a week ago
Different PDFs can give out different unwanted lines.
How can i detect them ?
Option 1 - Give the computer a rule: If you are able to narrow down what content it is that you would like to keep, the obvious criteria that sticks out to me is the exclusion of special characters, then you can filter your results based on this.
So let's say you agree that all "good lines" will be without special characters ('/', '-', and '=') for example, if a line DOES contain one of these items, you know you can remove it from the content you are keeping. This could be done in a for loop containing an if-then condition that looks something like this..
var lineArray = //code needed to make each line of the file an element of the array
For (cnt = 0; cnt < totalLines; cnt++)
{
var line = lineArray[cnt];
if (line.contains("/") || line.contains("-") || line.contains("="))
lineArray[cnt] = "";
}
At the end of this code you could simply get all the text within the array and it would no longer contain the unwanted lines. If there are unwanted lines however, that are virtually indistinguishable by characters, length, positioning etc. the previous approach begins to break down on some of the trickier lines.
This is because there is no rule you can give the computer to distinguish between the good and the bad without giving it a brain such as yours that recognizes parts of speech and sentence structure. In which case you might consider option 2, which is just that.
Option 2- Give the computer a brain: Given that the text you want to remove will more or less be incoherent documentation based on what you have shown us, an open source (or purchased) natural language processor may be what you are looking for.
I found a good beginner's intro at http://myreaders.info/10_Natural_Language_Processing.pdf with some information that might be of use to you. From the source,
"Linguistics is the science of language. Its study includes:
sounds (phonology),
word formation (morphology),
sentence structure (syntax),
meaning (semantics), and understanding (pragmatics) etc.
Syntactic Analysis : Here the analysis is of words in a sentence to know the grammatical structure of the sentence. The words are transformed into structures that show how the words relate to each others. Some word sequences may be rejected if they violate the rules of the language for how words may be combined. Example: An English syntactic analyzer would reject the sentence say : 'Boy the go the to store.' "
Using some sort of NLP, you can discover whether a given section of text contains a sentence or some incoherent rambling. This test could then be used as a filter in your program for what you would like to keep or remove.
Side note- As it appears your sample text is not just sentences but literature, sometimes characters will speak in sentence fragments as part of their nature given by the author. In this case, you could add a separate condition that if the text is contained within two quotations and has no special characters, you want to keep the text regardless.
In the end NLP may be more work than you require or that you want to do, in which case Option 1 is likely going to be your best bet. On the other hand, it may be just the thing you are looking for. Whatever the case or if you decide you need some combination of the two, best of luck! I hope this answer helps.
Iβm using cheeriojs to scrape content off a webpage, with the following HTML.
<p>
Although the PM's office could neither confirm nor deny this, the spokesperson, John Doe said the meeting took place on Sunday.
<br>
<br>
βThe outcome will be made public in due course,β John said in an SMS yesterday.
<br>
<br>
</p>
Iβm able to reach the content of interest, by class and id tags, as follows:
$('.top-stories .line.more').each(function(i, el){
//Do somethingβ¦
let content = $(this).next().html();
}
Once Iβve captured the content of interest, I βcleanβ it up using regular expressions, as below:
let cleanedContent = content.split(/<br>/).join(' \n ');
Inserting a newline where an empty tag (<br>) is matched. So far all is good, until I look at the cleaned content below:
Although the PM's office could neither confirm nor deny this, the spokesperson, Saima Shaanika said the meeting took place on Friday.
βThe outcome will be made public in due course,β
It appears that punctuation marks, and perhaps some other characters, are stored according to their unicode codes. I may be wrong on this, and would welcome some correction to this line of thought.
Assuming that they are stored as unicode codes, is there a module that I could pass the βcleanedContentβ variable, through to convert the unicodes to human readable punctuation marks/characters?
Should this not be possible, is there a better implementation of cheeriojs that would avoid this? I'm totally open to the notion that I'm not using cherriojs correctly, and would love some direction as to new approaches I could try instead.
One way I can think of, is writing a module containing several unicodes and their corresponding unicodes, then look for matches, and replace a matched code with the corresponding human readable character. I have some intuitive feeling that someone's already done this or something similar. I'd rather not try to reinvent the wheel.
Thanks in advance.
Cheerio uses htmlparser2 internally.
Because of this, you can use htmlparser2's decodeEntities option during the load of the HTML string, which allows you configure how HTML entities should be treated.
Example:
$ = cheerio.load('<ul id="fruits">...</ul>', {
decodeEntities: false
});
Relevant docs:
Cheerio
htmlparser2
I was wondering if there's a way to automatically control orphaned words in an HTML file, possibly by using CSS and/or Javascript (or something else, if anyone has an alternative suggestion).
By 'orphaned words', I mean singular words that appear on a new line at the end of a paragraph. For example:
"This paragraph ends with an undesirable orphaned
word."
Instead, it would be preferable to have the paragraph break as follows:
"This paragraph no longer ends with an undesirable
orphaned word."
While I know that I could manually correct this by placing an HTML non-breaking space ( ) between the final two words, I'm wondering if there's a way to automate the process, since manual adjustments like this can quickly become tedious for large blocks of text across multiple files.
Incidentally, the CSS2.1 properties orphans (and widows) only apply to entire lines of text, and even then only for the printing of HTML pages (not to mention the fact that these properties are largely unsupported by most major browsers).
Many professional page layout applications, such as Adobe InDesign, can automate the removal of orphans by automatically adding non-breaking spaces where orphans occur; is there any sort of equivalent solution for HTML?
You can avoid orphaned words by replacing the space between the last two words in a sentence with a non-breaking space ( ).
There are plugins out there that does this, for example jqWidon't or this jquery snippet.
There are also plugins for popular frameworks (such as typogrify for django and widon't for wordpress) that essentially does the same thing.
I know you wanted a javascript solution, but in case someone found this page a solution but for emails (where Javascript isn't an option), I decided to post my solution.
Use CSS white-space: nowrap. So what I do is surround the last two or three words (or wherever I want the "break" to be) in a span, add an inline CSS (remember, I deal with email, make a class as needed):
<td>
I don't <span style="white-space: nowrap;">want orphaned words.</span>
</td>
In a fluid/responsive layout, if you do it right, the last few words will break to a second line until there is room for those words to appear on one line.
Read more about about the white-space property on this link: http://www.w3schools.com/cssref/pr_text_white-space.asp
EDIT: 12/19/2015 - Since this isn't supported in Outlook, I've been adding a non-breaking space between the last two words in a sentence. It's less code, and supported everywhere.
EDIT: 2/20/2018 - I've discovered that the Outlook App (iOS and Android) doesn't support the entity, so I've had to combine both solutions: e.g.:
<td>
I don't <span style="white-space:nowrap;">want orphaned words.</span>
</td>
In short, no. This is something that has driven print designers crazy for years, but HTML does not provide this level of control.
If you absolutely positively want this, and understand the speed implications, you can try the suggestion here:
detecting line-breaks with jQuery?
That is the best solution I can imagine, but that does not make it a good solution.
I see there are 3rd party plugins suggested, but it's simpler to do it yourself. if all you want to do is replace the last space character with a non-breaking space, it's almost trivial:
const unorphanize = (str) => {
let iLast = str.lastIndexOf(' ');
let stArr = str.split('');
stArr[iLast] = ' ';
return stArr.join('')
}
I suppose this may miss some unique cases but it's worked for all my use cases. the caveat is that you can't just plug the output in where text would go, you have to set innerHTML = unorphanize(text) or otherwise parse it
If you want to handle it yourself, without jQuery, you can write a javascript snippet to replace the text, if you're willing to make a couple assumptions:
A sentence always ends with a period.
You always want to replace the whitespace before the last word with
Assuming you have this html (which is styled to break right before "end" in my browser...monkey with the width if needed):
<div id="articleText" style="width:360px;color:black; background-color:Yellow;">
This is some text with one word on its own line at the end.
<p />
This is some text with one word on its own line at the end.
</div>
You can create this javascript and put it at the end of your page:
<script type="text/javascript">
reformatArticleText();
function reformatArticleText()
{
var div = document.getElementById("articleText");
div.innerHTML = div.innerHTML.replace(/\S(\s*)\./g, " $1.");
}
</script>
The regex simply finds all instances (using the g flag) of a whitespace character (\S) followed by any number of non-whitespace characters (\s) followed by a period. It creates a back-reference to the non-white-space that you can use in the replace text.
You can use a similar regex to include other end punctuation marks.
If third-party JavaScript is an option, one can use typogr.js, a JavaScript "typogrify" implementation. This particular filter is called, unsurprisingly, Widont.
<script src="https://cdnjs.cloudflare.com/ajax/libs/typogr/0.6.7/typogr.min.js"></script>
<script>
document.body.innerHTML = typogr.widont(document.body.innerHTML);
</script>
</body>