Is there a method to encrypt text and the output to be a common english/spanish text or alike, and be able to decrypt it too?
I tried the Caesar encryption
http://en.wikipedia.org/wiki/Caesar_cipher
Plaintext: THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
Ciphertext: QEB NRFZH YOLTK CLU GRJMP LSBO QEB IXWV ALD
but I'd like the output for example:
Plaintext: THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG
Ciphertext: RADIO LIBRARY MAKE TABLE TIME ON KITCHEN DAY OF
Here's a possible solution. There may be performance issues with having the English or Spanish dictionary in an array, but you may just need common words.
function wordSwap(String str){
var dictionary = Array(a, the, brown, fox, over, ...);
var swapDictionary = randomizeArray(dictionary);
var newStr = "";
str = str.split(' ');
foreach(str as s){
var idx = dictionary.indexOf(s);
newStr += swapDictionary[idx]+" ";
}
return newStr;
}
Sure this is possible with a specifically crafted one time pad. XOR the plaintext and the target ciphertext and you get the key. key.length = max(pt.length, ct.length) This obviously works only for one PT, CT pair.
Jack's answer is pretty straightforward, and matches your Caesar Cipher well, but it's not very secure. It's just a substitution cipher with a much bigger "alphabet". Like your Caesar Cipher, that means it can be broken using frequency analysis. The words THE and AND are pretty common in English. ÉL and LA are extremely common in Spanish. So I look for "words" that show up very commonly in the cipher text and assume that they map to common words in my target language. I continue making guesses based on frequency and context until I work out portions of the message (or even the whole message). If I know this is probably about poodles, and I see a SUNRISE shows up often in the message, maybe I assume that SUNRISE is a poodle and I work from there.
I like it for being simple, but I don't like it so much if I want security.
We could devise a format preserving encryption scheme, which is what you kind of want here, but I'm not familiar with one that is designed to work on such a large domain (it's an area you could investigate though, or ask on http://crypto.stackexchange.com, which would be a better place for this question). The advantage of format preserving encryption is that the resulting message should be the same size as the original message.
But here's another solution that we could use, which is kind of a base-N encoding, where N is the size of our dictionary.
Start with an ordered dictionary and your plaintext. Look up each word in your dictionary and note the index. Use those indexes to create a new message where the word size is based on the number of elements in your dictionary. For simplicity, you could round this up to 64 bits per term, but you could also make each term any arbitrary number of bits if you're willing to do more bit math and let data spill across byte boundaries. Encrypt that message however you like (i.e. AES).
Now we need to encode that back into words. For values less than N-1, we just select that word out of the dictionary. For numbers equal to N-1 or greater, you can use the last word in the dictionary as a marker and then add the next word to it. So say we had a 1000 word dictionary (0..999) from A to ZYRIAN. We could encode 999 as ZYRIAN A and 1000 as ZYRIAN AARDVARK. If we needed to encode larger numbers we can chain. For example ZYRIAN ZYRIAN A is 1998. Of course you'll get better output sizes if you again let data split across byte boundaries, no value is greater than 2*N.
The key here is that we've broken the problem into two problems: a transcoder that allows us convert between arbitrary words and numbers, and encryption, which we can do using any standard encryption scheme.
Related
I'm very new to programming, and I'm trying to solve an exercise where you encode a string (in this case, a single word) based on whether or not the constituent characters occur twice or more. Characters occurring only once encode to, say "■", characters encoding twice or more encode to, say "X".
Example: input = "hippodrome" :: output = "■■XXX■■X■■"
I managed to solve it in a very convoluted way using nested loops and a key:value object storing character:occurrences, but i am trying to refactor the solution to be more efficient using a dynamically created RegExp, but i think i'm not understanding regex notation.
function encodeDupes(word) {
let encoded = "";
for (let char of word) {
let regex = new RegExp(char + "{2,}","ig"); // create a regex to see if "char" occurs 2 or more times
regex.test(word) ? encoded += "X" : encoded += "■"; // check this char against rest of word, push appropriately
}
return encoded;
}
it works with a more simple gate like char < "m" ? do X : do Y, and i thought i understood this answer here ({n,} = at least n occurrences), but i'm new enough that i'm still not sure if it's my regex or my logic.
thank you!
I'm very new to programming, ..., I am trying to refactor the solution to be more efficient using a dynamically created RegExp...
That's a bit of a catch 22 because regular expressions trade efficiency for convenience. In order for the regular expression "engine" to run, a grammar must be established, and a lexer, parser, and evaluator transform the string-based input expressions into program output. It's (sometimes) convenient to implement a particular program using regular expressions, but it's almost impossible to beat out a fundamental algorithm that isn't slowed down by the regular expression engine.
I managed to solve it in a very convoluted way using nested loops and a key:value object storing character:occurrences ...
Convoluted indeed, but sadly not uncommon to see even "expert" programmers do such things. An efficient algorithm emerges when we realise we don't need to count each letter. Instead, we only need to know whether a letter occurs more than once. Using two Set objects, once and more, we can determine the answer without needing to allocate counter memory per letter! And sets are lightning fast, thanks to O(1) constant-time lookup -
function encodeDupes(word)
{ const once = new Set
const more = new Set
for (const c of word)
if (more.has(c))
continue
else if (once.has(c))
(once.delete(c), more.add(c))
else
once.add(c)
return Array
.from(word, c => more.has(c) ? "X" : "■")
.join("")
}
console.log(encodeDupes("hippodrome"))
Output
■■XXX■■X■■
Usually RegExp are used to compare entire words or phrases.
Whenever {n,} is used, it's searching for two or more characters consecutively. Here's an example:
n{2,}
nn # match
anna # match
nan # does not match
The following RegExp isn't perfect, but it should suffice, replacing n with the character of your choice
(.*n{2,}.*)|(.*n.*n.*)+
(.*n{2,}.*) —— for consecutive ‘n’s
| —— or
(.*n.*n.*)+ —— ‘n’s with anything in between
Let me know how it goes.
If you double-click English text in Chrome, the whitespace-delimited word you clicked on is highlighted. This is not surprising. However, the other day I was clicking while reading some text in Japanese and noticed that some words were highlighted at word boundaries, even though Japanese doesn't have spaces. Here's some example text:
どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。
For example, if you click on 薄暗い, Chrome will correctly highlight it as a single word, even though it's not a single character class (this is a mix of kanji and hiragana). Not all the highlights are correct, but they don't seem random.
How does Chrome decide what to highlight here? I tried searching the Chrome source for "japanese word" but only found tests for an experimental module that doesn't seem active in my version of Chrome.
So it turns out v8 has a non-standard multi-language word segmenter and it handles Japanese.
function tokenizeJA(text) {
var it = Intl.v8BreakIterator(['ja-JP'], {type:'word'})
it.adoptText(text)
var words = []
var cur = 0, prev = 0
while (cur < text.length) {
prev = cur
cur = it.next()
words.push(text.substring(prev, cur))
}
return words
}
console.log(tokenizeJA('どこで生れたかとんと見当がつかぬ。何でも薄暗いじめじめした所でニャーニャー泣いていた事だけは記憶している。'))
// ["どこ", "で", "生れ", "たか", "とんと", "見当", "が", "つ", "か", "ぬ", "。", "何でも", "薄暗い", "じめじめ", "した", "所", "で", "ニャーニャー", "泣", "い", "て", "いた事", "だけ", "は", "記憶", "し", "て", "いる", "。"]
I also made a jsfiddle that shows this.
The quality is not amazing but I'm surprised this is supported at all.
Based on links posted by JonathonW, the answer basically boils down to: "There's a big list of Japanese words and Chrome checks to see if you double-clicked in a word."
Specifically, v8 uses ICU to do a bunch of Unicode-related text processing things, including breaking text up into words. The ICU boundary-detection code includes a "Dictionary-Based BreakIterator" for languages that don't have spaces, including Japanese, Chinese, Thai, etc.
And for your specific example of "薄暗い", you can find that word in the combined Chinese-Japanese dictionary shipped by ICU (line 255431). There are currently 315,671 total Chinese/Japanese words in the list. Presumably if you find a word that Chrome doesn't split properly, you could send ICU a patch to add that word.
It is still rudimentary (2022-11-27) but Google progresses very fast in the various fields of language parsing.
As of today's state of the code, Google Chrome broke |生れ|たか| and |泣|い|て|いた事|, both 'たか' and 'いた事' are odd lexically since both 'たか' and 'いた' (A) are usually used 'agglutinated' with the previous string 99,9% of the time (B) have very little meaning (frequency usage beyond the 10000th rank).
For Chinese and Japanese anyone can get better results with a vocabulary list of just 100,000 items (you add to the list as you read) that you organize from longest strings to shorter (single characters), for Chinese I set the length at 5 characters maximum, anything bigger is the name of an organization or such, for Japanese I set the maximum at 9 char length. Tonal languages have (65%) shorter words compared to non-tonal.
To parse a paragraph you launch a "do while" loop that starts from the first character and tries to find first the longest possible string in the vocabulary list, if that wasn't successful the search proceed towards the end of the list to shorter parts of words with less meaning, till it gets too simple letters or rare single-characters (you need to have all these single items, like, all 6,000 kanji/hanzi for daily reading).
You set a separator when you encounter punctuation or numbers and you skip to the next word.
It would be easier if I showed this at work but I don't know if people are interested and if I can post video links here.
I want the user of my node.js application to write down ideas, which then get stored in a database.
So far so good, but I don't want redundant entrys in that table, so I decided to check for similarity, using this one:
https://www.npmjs.com/package/string-similarity-js
Do you know a way, in which I can compare two strings by meaning? In like getting a high similarity score for "using public transport" vs "driving by train" which performs very poor in the above one.
To compare two strings by meaning, the strings would need to be convert first to a tensor and then evalutuate the distance or similarity between the tensors. Many algorithm can be used to convert strings to tensors - all related to the domain of interest. But the Universal Sentence Encoder is a wide broad sentence encoder that will project all words in one dimensional space. The cosine similarity can be used to see how closed some words are in meaning.
Example
Though king and kind are closed in hamming distance (difference of only one character), they are very different. Whereas queen and king though they seems not related (because all characters are different) are close in meaning. Therefore the distance (in meaning) between king and queen should be smaller than between king and kind as demonstrated in the following snippet.
<script src="https://cdn.jsdelivr.net/npm/#tensorflow/tfjs"></script>
<script src="https://cdn.jsdelivr.net/npm/#tensorflow-models/universal-sentence-encoder"></script>
<script>
(async() => {
const model = await use.load();
const embeddings = (await model.embed(['queen', 'king', 'kind'])).unstack()
tf.losses.cosineDistance(embeddings[0], embeddings[1], 0).print() // 0.39812755584716797
tf.losses.cosineDistance(embeddings[1], embeddings[2], 0).print() // 0.5585797429084778
})()
</script>
Comparing the meaning of two string is still an ongoing research. If you really want to solve the problem (or to get really good performance of your language modal) you should consider get a PhD.
For out of box solution at the time: I found this Github repo that implement google's BERT modal and use it to get the embedding of two sentences. In theory, the two sentence share the same meaning if there embedding is similar.
https://github.com/UKPLab/sentence-transformers
# the following is simplified from their README.md
embedder = SentenceTransformer('bert-base-nli-mean-tokens')
# Corpus with example sentences
S1 = ['A man is eating a food.']
S2 = ['A man is eating pasta.']
s1_embedding = embedder.encode(S1)
s2_embedding = embedder.encode(S2)
dist = scipy.spatial.distance.cdist([s1_embedding], [s2_embedding], "cosine")[0]
Example output (copied from their README.md)
Query: A man is eating pasta.
Top 5 most similar sentences in corpus:
A man is eating a piece of bread. (Score: 0.8518)
A man is eating a food. (Score: 0.8020)
A monkey is playing drums. (Score: 0.4167)
A man is riding a horse. (Score: 0.2621)
A man is riding a white horse on an enclosed ground. (Score: 0.2379)
I'm trying to find a fast(milliseconds or seconds) solution for having an inputted block of text and a large list(11 million) of specific words/phrases to test against. So I would like to see what words/phrases are in the inputted paragraph?
We use Javascript and have SQL, MongoDB & DynamoDB as existing data stores that we can integrate this solution into.
I've done searching on this problem but can only find checking if words exist in text. not the other way around.
All ideas are welcome!
In cases like these you want to eliminate as much unnecessary data as possible. Assuming that order matters:
First things first, make sure you have a B tree index built on your phrases database clustered on the phrase. This will speed up range lookup times.
Let n = 2 (or 1, if you're into that)
Split the text block into phrases of length n and perform a query for phrases in the dictionary that begin with any of the phrase pairs ('My Phrase%'). This won't perform 4521 million string comparisons thanks to the index.
Remember the phrases that are an exact match
Let n = n + 1
Repeat from step 3 using the reduced dictionary, until the reduced dictionary is empty
You can also make small optimizations here and there depending on the kind of matches you're looking for, such as, not matching across punctuation, only phrases of a certain word length, etc. In any case, I'd expect the time bottleneck here to be on disk access, rather than actual comparisons.
Also, I'm pretty sure I based this algorithm off of an existing one but I don't remember its name so bonus points to anyone who can name it. I think it had something to do with data warehousing/mining and calculating frequencies and patterns?
I am working on a project in which i have to extract text data from a PDF.
I am able to extract text from the PDF, but extracted text sometimes contains lines which i would like to strip off from it.
Here's and example of unwanted lines -
ISBN 0-7225-3293-8. = CONTENTS = Part One Part Two Epilogue
Page 1 / 94
And, here's an example of good line (which i'd like to keep) -
Dusk was falling as the boy arrived with his herd at an abandoned church.
I wanted to sleep a little longer, he thought. He had had the same dream that night as a week ago
Different PDFs can give out different unwanted lines.
How can i detect them ?
Option 1 - Give the computer a rule: If you are able to narrow down what content it is that you would like to keep, the obvious criteria that sticks out to me is the exclusion of special characters, then you can filter your results based on this.
So let's say you agree that all "good lines" will be without special characters ('/', '-', and '=') for example, if a line DOES contain one of these items, you know you can remove it from the content you are keeping. This could be done in a for loop containing an if-then condition that looks something like this..
var lineArray = //code needed to make each line of the file an element of the array
For (cnt = 0; cnt < totalLines; cnt++)
{
var line = lineArray[cnt];
if (line.contains("/") || line.contains("-") || line.contains("="))
lineArray[cnt] = "";
}
At the end of this code you could simply get all the text within the array and it would no longer contain the unwanted lines. If there are unwanted lines however, that are virtually indistinguishable by characters, length, positioning etc. the previous approach begins to break down on some of the trickier lines.
This is because there is no rule you can give the computer to distinguish between the good and the bad without giving it a brain such as yours that recognizes parts of speech and sentence structure. In which case you might consider option 2, which is just that.
Option 2- Give the computer a brain: Given that the text you want to remove will more or less be incoherent documentation based on what you have shown us, an open source (or purchased) natural language processor may be what you are looking for.
I found a good beginner's intro at http://myreaders.info/10_Natural_Language_Processing.pdf with some information that might be of use to you. From the source,
"Linguistics is the science of language. Its study includes:
sounds (phonology),
word formation (morphology),
sentence structure (syntax),
meaning (semantics), and understanding (pragmatics) etc.
Syntactic Analysis : Here the analysis is of words in a sentence to know the grammatical structure of the sentence. The words are transformed into structures that show how the words relate to each others. Some word sequences may be rejected if they violate the rules of the language for how words may be combined. Example: An English syntactic analyzer would reject the sentence say : 'Boy the go the to store.' "
Using some sort of NLP, you can discover whether a given section of text contains a sentence or some incoherent rambling. This test could then be used as a filter in your program for what you would like to keep or remove.
Side note- As it appears your sample text is not just sentences but literature, sometimes characters will speak in sentence fragments as part of their nature given by the author. In this case, you could add a separate condition that if the text is contained within two quotations and has no special characters, you want to keep the text regardless.
In the end NLP may be more work than you require or that you want to do, in which case Option 1 is likely going to be your best bet. On the other hand, it may be just the thing you are looking for. Whatever the case or if you decide you need some combination of the two, best of luck! I hope this answer helps.