I am working on a project in which i have to extract text data from a PDF.
I am able to extract text from the PDF, but extracted text sometimes contains lines which i would like to strip off from it.
Here's and example of unwanted lines -
ISBN 0-7225-3293-8. = CONTENTS = Part One Part Two Epilogue
Page 1 / 94
And, here's an example of good line (which i'd like to keep) -
Dusk was falling as the boy arrived with his herd at an abandoned church.
I wanted to sleep a little longer, he thought. He had had the same dream that night as a week ago
Different PDFs can give out different unwanted lines.
How can i detect them ?
Option 1 - Give the computer a rule: If you are able to narrow down what content it is that you would like to keep, the obvious criteria that sticks out to me is the exclusion of special characters, then you can filter your results based on this.
So let's say you agree that all "good lines" will be without special characters ('/', '-', and '=') for example, if a line DOES contain one of these items, you know you can remove it from the content you are keeping. This could be done in a for loop containing an if-then condition that looks something like this..
var lineArray = //code needed to make each line of the file an element of the array
For (cnt = 0; cnt < totalLines; cnt++)
{
var line = lineArray[cnt];
if (line.contains("/") || line.contains("-") || line.contains("="))
lineArray[cnt] = "";
}
At the end of this code you could simply get all the text within the array and it would no longer contain the unwanted lines. If there are unwanted lines however, that are virtually indistinguishable by characters, length, positioning etc. the previous approach begins to break down on some of the trickier lines.
This is because there is no rule you can give the computer to distinguish between the good and the bad without giving it a brain such as yours that recognizes parts of speech and sentence structure. In which case you might consider option 2, which is just that.
Option 2- Give the computer a brain: Given that the text you want to remove will more or less be incoherent documentation based on what you have shown us, an open source (or purchased) natural language processor may be what you are looking for.
I found a good beginner's intro at http://myreaders.info/10_Natural_Language_Processing.pdf with some information that might be of use to you. From the source,
"Linguistics is the science of language. Its study includes:
sounds (phonology),
word formation (morphology),
sentence structure (syntax),
meaning (semantics), and understanding (pragmatics) etc.
Syntactic Analysis : Here the analysis is of words in a sentence to know the grammatical structure of the sentence. The words are transformed into structures that show how the words relate to each others. Some word sequences may be rejected if they violate the rules of the language for how words may be combined. Example: An English syntactic analyzer would reject the sentence say : 'Boy the go the to store.' "
Using some sort of NLP, you can discover whether a given section of text contains a sentence or some incoherent rambling. This test could then be used as a filter in your program for what you would like to keep or remove.
Side note- As it appears your sample text is not just sentences but literature, sometimes characters will speak in sentence fragments as part of their nature given by the author. In this case, you could add a separate condition that if the text is contained within two quotations and has no special characters, you want to keep the text regardless.
In the end NLP may be more work than you require or that you want to do, in which case Option 1 is likely going to be your best bet. On the other hand, it may be just the thing you are looking for. Whatever the case or if you decide you need some combination of the two, best of luck! I hope this answer helps.
Related
If you double-click English text in Chrome, the whitespace-delimited word you clicked on is highlighted. This is not surprising. However, the other day I was clicking while reading some text in Japanese and noticed that some words were highlighted at word boundaries, even though Japanese doesn't have spaces. Here's some example text:
どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。
For example, if you click on 薄暗い, Chrome will correctly highlight it as a single word, even though it's not a single character class (this is a mix of kanji and hiragana). Not all the highlights are correct, but they don't seem random.
How does Chrome decide what to highlight here? I tried searching the Chrome source for "japanese word" but only found tests for an experimental module that doesn't seem active in my version of Chrome.
So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese.
function tokenizeJA(text) {
var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
it.adoptText(text)
var words = []
var cur = 0, prev = 0
while (cur < text.length) {
prev = cur
cur = it.next()
words.push(text.substring(prev, cur))
}
return words
}
console.log(tokenizeJA('どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。'))
// ["どこ", "で", "生れ", "たか", "とんと", "見当", "が", "つ", "か", "ぬ", "。", "何でも", "薄暗い", "じめじめ", "した", "所", "で", "ニャーニャー", "泣", "い", "て", "いた事", "だけ", "は", "記憶", "し", "て", "いる", "。"]
I also made a jsfiddle that shows this.
The quality is not amazing but I'm surprised this is supported at all.
Based on links posted by JonathonW, the answer basically boils down to: "There's a big list of Japanese words and Chrome checks to see if you double-clicked in a word."
Specifically, v8 uses ICU to do a bunch of Unicode-related text processing things, including breaking text up into words. The ICU boundary-detection code includes a "Dictionary-Based BreakIterator" for languages that don't have spaces, including Japanese, Chinese, Thai, etc.
And for your specific example of "薄暗い", you can find that word in the combined Chinese-Japanese dictionary shipped by ICU (line 255431). There are currently 315,671 total Chinese/Japanese words in the list. Presumably if you find a word that Chrome doesn't split properly, you could send ICU a patch to add that word.
It is still rudimentary (2022-11-27) but Google progresses very fast in the various fields of language parsing.
As of today's state of the code, Google Chrome broke |生れ|たか| and |泣|い|て|いた事|, both 'たか' and 'いた事' are odd lexically since both 'たか' and 'いた' (A) are usually used 'agglutinated' with the previous string 99,9% of the time (B) have very little meaning (frequency usage beyond the 10000th rank).
For Chinese and Japanese anyone can get better results with a vocabulary list of just 100,000 items (you add to the list as you read) that you organize from longest strings to shorter (single characters), for Chinese I set the length at 5 characters maximum, anything bigger is the name of an organization or such, for Japanese I set the maximum at 9 char length. Tonal languages have (65%) shorter words compared to non-tonal.
To parse a paragraph you launch a "do while" loop that starts from the first character and tries to find first the longest possible string in the vocabulary list, if that wasn't successful the search proceed towards the end of the list to shorter parts of words with less meaning, till it gets too simple letters or rare single-characters (you need to have all these single items, like, all 6,000 kanji/hanzi for daily reading).
You set a separator when you encounter punctuation or numbers and you skip to the next word.
It would be easier if I showed this at work but I don't know if people are interested and if I can post video links here.
I'm trying to find a fast(milliseconds or seconds) solution for having an inputted block of text and a large list(11 million) of specific words/phrases to test against. So I would like to see what words/phrases are in the inputted paragraph?
We use Javascript and have SQL, MongoDB & DynamoDB as existing data stores that we can integrate this solution into.
I've done searching on this problem but can only find checking if words exist in text. not the other way around.
All ideas are welcome!
In cases like these you want to eliminate as much unnecessary data as possible. Assuming that order matters:
First things first, make sure you have a B tree index built on your phrases database clustered on the phrase. This will speed up range lookup times.
Let n = 2 (or 1, if you're into that)
Split the text block into phrases of length n and perform a query for phrases in the dictionary that begin with any of the phrase pairs ('My Phrase%'). This won't perform 4521 million string comparisons thanks to the index.
Remember the phrases that are an exact match
Let n = n + 1
Repeat from step 3 using the reduced dictionary, until the reduced dictionary is empty
You can also make small optimizations here and there depending on the kind of matches you're looking for, such as, not matching across punctuation, only phrases of a certain word length, etc. In any case, I'd expect the time bottleneck here to be on disk access, rather than actual comparisons.
Also, I'm pretty sure I based this algorithm off of an existing one but I don't remember its name so bonus points to anyone who can name it. I think it had something to do with data warehousing/mining and calculating frequencies and patterns?
Since this question does not contain a specific question on regex but more on it's design/approach, it might take a while to understand the requirements and their dependencies. I have done everything I can to make it as easy as possible with this fully working yet not elegant solution(deadlink).
I need to optimize text in a messaging platform that is being created/edited by others and may have to be sanitized with regex. All optimizations need to be done with one single regex, since these happen often and are quite expensive (or am I wrong on this?). Furthermore the regex needs to be language-agnostic (at least compatible with Javascript and Php). Last but not least, the optimized text must not contain (additional) Html as it is used in a text-only environment.
Requirements
Optimize lines
Remove single lines
Do not remove single lines that end with two|no spaces (thus allow editors to force a newline)
Do not remove empty lines (double line breaks)
Do not remove single lines that start with symbol|char|digit|entity+space (raw lists)
Condense multiple consecutive empty lines (double line breaks) to one double line break
Optimize spaces
Remove excess spaces
Do not remove spaces at the end of a sentence
Optimize comments
Remove single line comments
Do not remove trailing comments
Overall
Preserve Html and do not add Html
Intermediate solution
So far, my solution is to combine 4 regexes which 'match' my requirements and get replaced by a single space:
Matches single lines while leaving empty lines intact and preserving raw lists: \n(?!\n|[-_.○•♥→›>+%\/*~=] |[a-zA-Z_1-9+][\.|\)|\:|\*]) (the length is due to several list-style-types I want to support)
Matches excess empty lines: (\n+)(?=\n\n)
Matches excess spaces: +
Matches single line comments (while ignoring trailing comments): ^\n?\/\/ .+\n
To make the optimization rather inexpensive, I concatenate them with | to one single regex which I can use in Javascript (as well as Php).
r = new RegExp(" \n(?!\n|[-_.○•♥→›>+%\/*~=] |[a-zA-Z_1-9+][.):*] )|(\n+)(?=\n\n)| + |^\n?\/\/ .+\n", "gm");
i = document.getElementById("input").innerHTML;
p = " ";
o = i.replace(r, p);
document.getElementById("output").innerHTML = o;
#input, #output { width: 100%; height: 88vh; }
#input { display: none; } #output { border: none; }
<textarea id="input">
MAKE PARAGRAPHS
This is the first paragraph.
Some sentences end with newlines.
Some don't. We need to cope with that.
This is the second paragraph.
It contains some unnecessary spaces.
Even at the end of a line.
This is the third paragraph.
Some sentences end with question- and exclamation-marks.
I hope that is ok for you. Is it? That's great! Really.
KEEP LISTS
This is an unordered list, starting with a minus+space:
- This is the first item.
- This is the second item.
- This is the third item.
Here is an unordered list, starting with entity|symbol+space:
• This is the second item.
> This is the third item. // Works in php only
* This is the fifth item.
This is a (manually) ordered lists, starting with char|digit+entity+space:
1. This is the first item.
b) This is the second item.
3: This is the third item.
Here is a mathematical list, starting with operators:
+ Plus
- Minus
% Percentage
/ Division
* Multiply
~ Like
= Equal
These are (manually) ordered lists, which are not summed up because they do not end with a space:
1 This is the first item.
b This is the second item.
I like the third item.
First: This works.
Second: It works great.
Third: That is nice!
KEEP HTML
The input text may contain Html.
The output text must simply keep it for further processing.
The output must not add Html as it is processed in a text-only environment.
I know this sounds stupid, but it isn't.
REMOVE COMMENTS
Single/whole line comments are being removed.
// Sources
// Removing single lines: https://regex101.com/r/qU1eP8/5
// Removing comments: https://www.perlmonks.org/?node_id=996552
// Tests
// Dialog: https://api.sefzig.net/dialog/test/regex/
// Jsbin: https://jsbin.com/goromad/edit?output
// Regex101: https://regex101.com/r/Xz5atA/2
// Regexr: https://regexr.com/45svm
Thank you, regex ♥ // Problem solved
~Fin~
</textarea>
<textarea id="output"><!-- Press "Run" --></textarea>
My request
Since I am not a regex-expert and my approach feels cumbersome, I'd like to hear your suggestions. I know Regex is expensive and everything can be done better.
You might wonder about a few details I haven't mentioned here for the sake of clarity. You also might want to test my Regexes. This is why I have set up a sandbox, isolating the requirements (Regexes), containing an example text with all use-cases as well as a detailed description:
https://api.sefzig.net/dialog/test/regex/(deadlink)
In case you want to use the features of great tools out there, here you go:
Regexr: https://regexr.com/45svm
Regex101: https://regex101.com/r/Xz5atA/2
Jsbin: https://jsbin.com/goromad/edit?output
Thank you
for helping me straighten this important feature of my messaging platform! Please feel free to enhance my approach, suggest an alternative or use the results in your own project ♥
This is my first question on stack overflow. I have researched a lot. Please bear with me if I have done anything wrong and help me fix that.
I was wondering if there's a way to automatically control orphaned words in an HTML file, possibly by using CSS and/or Javascript (or something else, if anyone has an alternative suggestion).
By 'orphaned words', I mean singular words that appear on a new line at the end of a paragraph. For example:
"This paragraph ends with an undesirable orphaned
word."
Instead, it would be preferable to have the paragraph break as follows:
"This paragraph no longer ends with an undesirable
orphaned word."
While I know that I could manually correct this by placing an HTML non-breaking space ( ) between the final two words, I'm wondering if there's a way to automate the process, since manual adjustments like this can quickly become tedious for large blocks of text across multiple files.
Incidentally, the CSS2.1 properties orphans (and widows) only apply to entire lines of text, and even then only for the printing of HTML pages (not to mention the fact that these properties are largely unsupported by most major browsers).
Many professional page layout applications, such as Adobe InDesign, can automate the removal of orphans by automatically adding non-breaking spaces where orphans occur; is there any sort of equivalent solution for HTML?
You can avoid orphaned words by replacing the space between the last two words in a sentence with a non-breaking space ( ).
There are plugins out there that does this, for example jqWidon't or this jquery snippet.
There are also plugins for popular frameworks (such as typogrify for django and widon't for wordpress) that essentially does the same thing.
I know you wanted a javascript solution, but in case someone found this page a solution but for emails (where Javascript isn't an option), I decided to post my solution.
Use CSS white-space: nowrap. So what I do is surround the last two or three words (or wherever I want the "break" to be) in a span, add an inline CSS (remember, I deal with email, make a class as needed):
<td>
I don't <span style="white-space: nowrap;">want orphaned words.</span>
</td>
In a fluid/responsive layout, if you do it right, the last few words will break to a second line until there is room for those words to appear on one line.
Read more about about the white-space property on this link: http://www.w3schools.com/cssref/pr_text_white-space.asp
EDIT: 12/19/2015 - Since this isn't supported in Outlook, I've been adding a non-breaking space between the last two words in a sentence. It's less code, and supported everywhere.
EDIT: 2/20/2018 - I've discovered that the Outlook App (iOS and Android) doesn't support the entity, so I've had to combine both solutions: e.g.:
<td>
I don't <span style="white-space:nowrap;">want orphaned words.</span>
</td>
In short, no. This is something that has driven print designers crazy for years, but HTML does not provide this level of control.
If you absolutely positively want this, and understand the speed implications, you can try the suggestion here:
detecting line-breaks with jQuery?
That is the best solution I can imagine, but that does not make it a good solution.
I see there are 3rd party plugins suggested, but it's simpler to do it yourself. if all you want to do is replace the last space character with a non-breaking space, it's almost trivial:
const unorphanize = (str) => {
let iLast = str.lastIndexOf(' ');
let stArr = str.split('');
stArr[iLast] = ' ';
return stArr.join('')
}
I suppose this may miss some unique cases but it's worked for all my use cases. the caveat is that you can't just plug the output in where text would go, you have to set innerHTML = unorphanize(text) or otherwise parse it
If you want to handle it yourself, without jQuery, you can write a javascript snippet to replace the text, if you're willing to make a couple assumptions:
A sentence always ends with a period.
You always want to replace the whitespace before the last word with
Assuming you have this html (which is styled to break right before "end" in my browser...monkey with the width if needed):
<div id="articleText" style="width:360px;color:black; background-color:Yellow;">
This is some text with one word on its own line at the end.
<p />
This is some text with one word on its own line at the end.
</div>
You can create this javascript and put it at the end of your page:
<script type="text/javascript">
reformatArticleText();
function reformatArticleText()
{
var div = document.getElementById("articleText");
div.innerHTML = div.innerHTML.replace(/\S(\s*)\./g, " $1.");
}
</script>
The regex simply finds all instances (using the g flag) of a whitespace character (\S) followed by any number of non-whitespace characters (\s) followed by a period. It creates a back-reference to the non-white-space that you can use in the replace text.
You can use a similar regex to include other end punctuation marks.
If third-party JavaScript is an option, one can use typogr.js, a JavaScript "typogrify" implementation. This particular filter is called, unsurprisingly, Widont.
<script src="https://cdnjs.cloudflare.com/ajax/libs/typogr/0.6.7/typogr.min.js"></script>
<script>
document.body.innerHTML = typogr.widont(document.body.innerHTML);
</script>
</body>
It seems + is not the right operator to handle the concatenation of strings in JavaScript. what are some alternatives to handle the both the ltr and rtl cases?
The problem is, + is not right operator to concatenate strings at all. Or maybe it is, but concatenating string is an Internationalization bug.
Instead of simply concatenating them, one should actually format them. So what you should actually do, is use placeholders:
var somePattern = "This language is written {0}.";
var someMessage = somePattern.format("LTR");
This way, the translator would be able to re-order the sentence, including word order. And I believe it solves your problem.
For formatting function, let me quote this excellent answer:
String.prototype.format = function() {
var args = arguments;
return this.replace(/\{(\d+)\}/g, function() {
return args[arguments[1]];
});
};
EDIT: Adding information about directionality marks.
Sometimes, when you have multiple placeholders you may lose the control of string direction, i.e. {0}/{1} would still be shown as first/second instead of desired second/last. To fix this, you would add Strong Directionality Mark to the pattern, i.e. {0}/{1}. is an HTML entity that resolves to Unicode code point U+200F, that is right-to-left strong directionality mark.
Actually, assuming both string are localized and you want the string on the right to be displayed logically after the string on the left, then + sometimes works fine. Strings in languages such as Arabic should be displayed RTL (right to left) on the screen, but the character ordering is still meant to be LTR (left to right) in memory. So + operator is logically consistent to use for generating an 'ordered list' of terms in any language.
But there are also scenarios where + does not solve the problem correctly. There are scenarios where the correct solution is to follow the grammar of the containing language. For instance, are you really embedding an English word in an Arabic sentence? Or vice versa? Regardless, the solution here is to do string formatting, where the containing sentence localized has a placeholder for the foreign term, like {0}.
The third case is what if there is no grammatical relationship because it is just two separate sentences? In this case there is no correct ordering. E.g. if you have an English sentence displayed in front of an Arabic sentence. An English speaker will probably read the sentences LTR (left sentence first, then right). An Arabic speaker will probably read the sentences RTL. Either way it's unclear to everyone which order the author intended the sentences to be read in. :)