Minifying javascript via unicode - javascript

on dwitter.net i often see dweets that are encoded interestingly to minify the JS to character count.
for example https://www.dwitter.net/d/22372 (or https://www.dwitter.net/d/11506)
eval(unescape(escape`𮀮𩡯𫡴🐧𜡥𫐠𨐧𛁸𛡦𪑬𫁔𩑸𭀨𙱜𭐲𝠲𜀠𙰬𜰬𜠵𚐊𭀿𜀺𩀽𮀮𩱥𭁉𫑡𩱥𡁡𭁡𚀰𛀰𛁶🐳𝠬𭠩𛡤𨑴𨐊𩡯𬠨𨰮𭱩𩁴𪁼👷👩🐹𜰶𞱩𛐭𞰩𩐽𪐥𭠪𝠬𩁛𪐪𝀫𜱝🠵𜁼𯁸𛡦𪑬𫁒𩑣𭀨𦀽𩐫𩐯𜠪𤰨𭀭𪐯𭰩𚱷𛁩𛰳𛑥𚡃𚁴𛑘𛰹𞐩𚱥𚰵𜀬𞐬𪐼𜐿𭰺𞐩`.replace(/u../g,'')))
Now I understand how to decode this and read the javascript, it's pretty trivial
unescape(escape`𮀮𩡯𫡴🐧𜡥𫐠𨐧𛁸𛡦𪑬𫁔𩑸𭀨𙱜𭐲𝠲𜀠𙰬𜰬𜠵𚐊𭀿𜀺𩀽𮀮𩱥𭁉𫑡𩱥𡁡𭁡𚀰𛀰𛁶🐳𝠬𭠩𛡤𨑴𨐊𩡯𬠨𨰮𭱩𩁴𪁼👷👩🐹𜰶𞱩𛐭𞰩𩐽𪐥𭠪𝠬𩁛𪐪𝀫𜱝🠵𜁼𯁸𛡦𪑬𫁒𩑣𭀨𦀽𩐫𩐯𜠪𤰨𭀭𪐯𭰩𚱷𛁩𛰳𛑥𚡃𚁴𛑘𛰹𞐩𚱥𚰵𜀬𞐬𪐼𜐿𭰺𞐩`.replace(/u../g,''))
returns
x.font='2em a',x.fillText('\u2620 ',3,25)
t?0:d=x.getImageData(0,0,v=36,v).data
for(c.width|=w=i=936;i--;)e=i%v*6,d[i*4+3]>50||x.fillRect(X=e+e/2*S(t-i/w)+w,i/3-e*C(t-X/99)+e+50,9,i<1?w:9)
but what I don't understand is how to encode js like this.
I noticed there is an intermediary step in this process
running:
escape`𮀮𩡯𫡴🐧𜡥𫐠𨐧𛁸𛡦𪑬𫁔𩑸𭀨𙱜𭐲𝠲𜀠𙰬𜰬𜠵𚐊𭀿𜀺𩀽𮀮𩱥𭁉𫑡𩱥𡁡𭁡𚀰𛀰𛁶🐳𝠬𭠩𛡤𨑴𨐊𩡯𬠨𨰮𭱩𩁴𪁼👷👩🐹𜰶𞱩𛐭𞰩𩐽𪐥𭠪𝠬𩁛𪐪𝀫𜱝🠵𜁼𯁸𛡦𪑬𫁒𩑣𭀨𦀽𩐫𩐯𜠪𤰨𭀭𪐯𭰩𚱷𛁩𛰳𛑥𚡃𚁴𛑘𛰹𞐩𚱥𚰵𜀬𞐬𪐼𜐿𭰺𞐩`
returns
%uD878%uDC2E%uD866%uDC6F%uD86E%uDC74%uD83D%uDC27%uD832%uDC65%uD86D%uDC20%uD861%uDC27%uD82C%uDC78%uD82E%uDC66%uD869%uDC6C%uD86C%uDC54%uD865%uDC78%uD874%uDC28%uD827%uDC5C%uD875%uDC32%uD836%uDC32%uD830%uDC20%uD827%uDC2C%uD833%uDC2C%uD832%uDC35%uD829%uDC0A%uD874%uDC3F%uD830%uDC3A%uD864%uDC3D%uD878%uDC2E%uD867%uDC65%uD874%uDC49%uD86D%uDC61%uD867%uDC65%uD844%uDC61%uD874%uDC61%uD828%uDC30%uD82C%uDC30%uD82C%uDC76%uD83D%uDC33%uD836%uDC2C%uD876%uDC29%uD82E%uDC64%uD861%uDC74%uD861%uDC0A%uD866%uDC6F%uD872%uDC28%uD863%uDC2E%uD877%uDC69%uD864%uDC74%uD868%uDC7C%uD83D%uDC77%uD83D%uDC69%uD83D%uDC39%uD833%uDC36%uD83B%uDC69%uD82D%uDC2D%uD83B%uDC29%uD865%uDC3D%uD869%uDC25%uD876%uDC2A%uD836%uDC2C%uD864%uDC5B%uD869%uDC2A%uD834%uDC2B%uD833%uDC5D%uD83E%uDC35%uD830%uDC7C%uD87C%uDC78%uD82E%uDC66%uD869%uDC6C%uD86C%uDC52%uD865%uDC63%uD874%uDC28%uD858%uDC3D%uD865%uDC2B%uD865%uDC2F%uD832%uDC2A%uD853%uDC28%uD874%uDC2D%uD869%uDC2F%uD877%uDC29%uD82B%uDC77%uD82C%uDC69%uD82F%uDC33%uD82D%uDC65%uD82A%uDC43%uD828%uDC74%uD82D%uDC58%uD82F%uDC39%uD839%uDC29%uD82B%uDC65%uD82B%uDC35%uD830%uDC2C%uD839%uDC2C%uD869%uDC3C%uD831%uDC3F%uD877%uDC3A%uD839%uDC29
which then gets regex replaced with .replace(/u../g,''), but getting this string from minified javascript isn't easy for me.
simply running encodeURIComponent() or escape() doesn't get you quite there, though it gets you part of the way there.
So how do I get the string of my javascript converted into a string containing %uD then the character code for each?

I am also on dwitter.
The code compressor actually began with a dweet (https://www.dwitter.net/d/23092).
It was made so people could add more bytes into their demos by going right up to 194 chars instead of having the limit of 140.
Note this does not reduce the byte size.
Even though this reduces the amount of characters, the size stays the same
There is also an uncompressor at https://www.dwitter.net/d/14246
The simplified code for this is a simple unpack function:
function unpack(strange_blocky_code) {
const index = code.toLowerCase().search(/eval\(unescape\(escape`/g)
if (index >= 0) {
const start = strange_blocky_code.slice(0, index)
const end = strange_blocky_code.slice(index)
const result = eval(end.slice(4))
if (result) return start + result // returns readable (but trivial) code
}
}
The simplified compressing code is:
function compress(readable_code) {
const value = [...readable_code.trim()]
let code = ''
for (let character of value) {
const char = character.charCodeAt(0)
if (char > 255) character = escape(character).replace(/%u/g, "\\u")
code += character
}
const compressed =
String.fromCharCode(...[...code.length % 2 ? code + ";" : code]
.map((item, index) =>
item.charCodeAt() | (index % 2 ? 0xDF00 : 0xDB00)
)
)
return `eval(unescape(escape\`${compressed}\`.replace(/u../g,'')))`
}
If you're looking for editors, these are two that I like to use:
https://greyhope.uk/Dweet-Runner/index.html made by GreyHope
https://dweetabase.3d2k.com/ made by Frank Force
I hope this helps at all.

Related

Leading and trailing zeros in numbers

I am working on a project where I require to format incoming numbers in the following way:
###.###
However I noticed some results I didn't expect.
The following works in the sense that I don't get an error:
console.log(07);
// or in my case:
console.log(007);
Of course, it will not retain the '00' in the value itself, since that value is effectively 7.
The same goes for the following:
console.log(7.0);
// or in my case:
console.log(7.000);
JavaScript understands what I am doing, but in the end the actual value will be 7, which can be proven with the following:
const leadingValue = 007;
const trailingValue = 7.00;
console.log(leadingValue, trailingValue); // both are exactly 7
But what I find curious is the following: the moment I combine these two I get a syntax error:
// but not this:
console.log(007.000);
1) Can someone explain why this isn't working?
I'm trying to find a solution to store numbers/floats with the exact precision without using string.
2) Is there any way in JS/NodeJS or even TypeScript to do this without using strings?
What I currently want to do is to receive the input, scan for the format and store that as a separate property and then parse the incoming value since parseInt('007.000') does work. And when the user wants to get this value return it back to the user... in a string.. unfortunately.
1) 007.000 is a syntax error because 007 is an octal integer literal, to which you're then appending a floating point part. (Try console.log(010). This prints 8.)
2) Here's how you can achieve your formatting using Intl.NumberFormat...
var myformat = new Intl.NumberFormat('en-US', {
minimumIntegerDigits: 3,
minimumFractionDigits: 3
});
console.log(myformat.format(7)); // prints 007.000
Hi
You can use an aproach that uses string funtions .split .padStart and .padEnd
Search on MDN
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/padStart
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/padEnd
Here you have an example:
const x = 12.1;
function formatNumber( unformatedNumber) {
const desiredDecimalPad = 3;
const desiredNonDecimalPad = 3;
const unformatedNumberString = unformatedNumber.toString();
const unformatedNumberArr = unformatedNumberString.split('.');
const decimalStartPadded = unformatedNumberArr[0].padStart(desiredDecimalPad, '0');
const nonDecimalEndPadded = unformatedNumberArr[1].padEnd(desiredNonDecimalPad, '0');
const formatedNumberString = decimalStartPadded + '.' + nonDecimalEndPadded;
return formatedNumberString;
}
console.log(formatNumber(x))

Does JavaScript support array/list comprehensions like Python?

I'm practicing/studying both JavaScript and Python. I'm wondering if Javascript has the equivalence to this type of coding.
I'm basically trying to get an array from each individual integer from the string for practice purposes. I'm more proficient in Python than JavaScript
Python:
string = '1234-5'
forbidden = '-'
print([int(i) for i in str(string) if i not in forbidden])
Does Javascript have something similar for me to do above?
Update: Array comprehensions were removed from the standard. Quoting MDN:
The array comprehensions syntax is non-standard and removed starting with Firefox 58. For future-facing usages, consider using Array.prototype.map, Array.prototype.filter, arrow functions, and spread syntax.
See this answer for an example with Array.prototype.map:
let emails = people.map(({ email }) => email);
Original answer:
Yes, JavaScript will support array comprehensions in the upcoming EcmaScript version 7.
Here's an example.
var str = "1234-5";
var ignore = "-";
console.log([for (i of str) if (!ignore.includes(i)) i]);
Given the question's Python code
print([int(i) for i in str(string) if i not in forbidden])
this is the most direct translation to JavaScript (ES2015):
const string = '1234-5';
const forbidden = '-';
console.log([...string].filter(c => !forbidden.includes(c)).map(c => parseInt(c)));
// result: [ 1, 2, 3, 4, 5 ]
Here is a comparison of the Python and JavaScript code elements being used:
(Python -> Javascript):
print -> console.log
unpacking string to list -> spread operator
list comprehension 'if' -> Array.filter
list comprehension 'for' -> Array.map
substr in str? -> string.includes
Reading the code, I assume forbidden can have more than 1 character. I'm also assuming the output should be "12345"
var string = "12=34-5";
var forbidden = "=-";
console.log(string.split("").filter(function(str){
return forbidden.indexOf(str) < 0;
}).join(""))
If the output is "1" "2" "3" "4" "5" on separate lines
var string = "12=34-5";
var forbidden = "=-";
string.split("").forEach(function(str){
if (forbidden.indexOf(str) < 0) {
console.log(str);
}
});
Not directly, but it's not hard to replicate.
var string = "1234-5";
var forbidden = "-";
string.split("").filter(function(str){
if(forbidden.indexOf(str) < 0) {
return str;
}
}).forEach(function(letter) { console.log(letter);});
I guess more directly:
for(var i=0 ; i < str.length ; i++) {
if(forbidden.indexOf(str) < 0) {
console.log(str[i]);
}
}
But there's no built in way to filter in your for loop.
You could easily achieve this behavior using an application functor.
Array.prototype.ap = function(xs) {
return this.reduce((acc, f) => acc.concat(xs.map(f)), [])
}
const result = [x => x +1].ap([2])
console.log(result)
JavaScript no longer supports array comprehensions.
I too was looking for the JavaScript equivalent. Mozilla Developer's Network indicates that this functionality is no longer supported.
The preferred syntax is referenced in the aforementioned link.
For "completeness"-sake, here's a shorter regexp version.
var str = "1234-5";
var ignore = "-=";
console.log(str.replace(new RegExp(ignore.split("").join("|")), "").split(""));
EDIT: To make sure that RegExp does not "choke" on special characters, ignore can be implemented as regexp literal, instead of a string:
var str = "1234-5";
var ignore = /[\+=-]/;
console.log(str.replace(ignore, "").split(""));
It does have a poor mans version
const string = '1234-5'
const forbidden = '-'
print([int(i) for i in str(string) if i not in forbidden])
const result = string.split('').filter(char => char !== forbidden);
console.log(result)
In JS you can only iterate over single elements in array, so no extraction of multiple entries at a time like in Python.
For this particular case you should use a RegExp to filter the string though.
You could have a look at CoffeeScript.
CoffeeScript adds missing features to java-script and allows you to write cleaner, more readable code. https://coffeescript.org/#coffeescript-2
You write a .coffee file and the coffeScript-compiler compiles your coffee file into a JavaScript file. Because the translation into JavaScript happens by compiling, the script should not run any slower.
So your code would look like the following in coffee script:
string = '1234-5'
forbidden = '-'
alert(JSON.stringify(+i for i in string when i isnt forbidden))
Honestly, this is even easier to read then the python counterpart. And it compiles quickly to the fallowing JavaScript:
var forbidden, i, string;
string = '1234-5';
forbidden = '-';
alert(JSON.stringify((function() {
var j, len, results;
results = [];
for (j = 0, len = string.length; j < len; j++) {
i = string[j];
if (i !== forbidden) {
results.push(+i);
}
}
return results;
})()));
You don’t even need to install anything. On their website you can play around with it, and it will show you the translated JavaScript code.
Javascript doesn't need list comprehensions because the map and filter functions work better in the language compared to Python.
In Python:
[int(i) for i in '1234-5' if i != '-']
# is equivalent to the ugly
list(map(lambda _: int(_),filter(lambda _: _!='-','1234-5')))
# so we use list comprehensions
In Javascript, to me this is fine once you're familiar with the syntax:
[...'1234-5'].filter(_=> _!='-').map(_=> parseInt(_))

How do I convert String to Number according to locale (opposite of .toLocaleString)?

If I do:
var number = 3500;
alert(number.toLocaleString("hi-IN"));
I will get ३,५०० in Hindi.
But how can I convert it back to 3500.
I want something like:
var str='३,५००';
alert(str.toLocaleNumber("en-US"));
So, that it can give 3500.
Is it possible by javascript or jquery?
I think you are looking for something like:
https://github.com/jquery/globalize
Above link will take you to git project page. This is a js library contributed by Microsoft.
You should give it one try and try to use formt method of that plugin. If you want to study this plugin, here is the link for the same:
http://weblogs.asp.net/scottgu/jquery-globalization-plugin-from-microsoft
I hope this is what you are looking for and will resolve your problem soon. If it doesn't work, let me know.
Recently I've been struggling with the same problem of converting stringified number formatted in any locale back to the number.
I've got inspired by the solution implemented in NG Prime InputNumber component. They use Intl.NumberFormat.prototype.format() (which I recommend) to format the value to locale string, and then create set of RegExp expressions based on simple samples so they can cut off particular expressions from formatted string.
This solution can be simplified with using Intl.Numberformat.prototype.formatToParts(). This method returns information about grouping/decimal/currency and all the other separators used to format your value in particular locale, so you can easily clear them out of previously formatted string. It seems to be the easiest solution, that will cover all cases, but you must know in what locale the value has been previously formatted.
Why Ng Prime didn't go this way? I think its because Intl.Numberformat.prototype.formatToParts() does not support IE11, or perhaps there is something else I didn't notice.
A complete code example using this solution can be found here.
Unfortunately you will have to tackle the localisation manually. Inspired by this answer , I created a function that will manually replace the Hindi numbers:
function parseHindi(str) {
return Number(str.replace(/[०१२३४५६७८९]/g, function (d) {
return d.charCodeAt(0) - 2406;
}).replace(/[०१२३४५६७८९]/g, function (d) {
return d.charCodeAt(0) - 2415;
}));
}
alert(parseHindi("३५००"));
Fiddle here: http://jsfiddle.net/yyxgxav4/
You can try this out
function ConvertDigits(input, source, target) {
var systems = {
arabic: 48, english: 48, tamil: 3046, kannada: 3302, telugu: 3174, hindi: 2406,
malayalam: 3430, oriya: 2918, gurmukhi: 2662, nagari: 2534, gujarati: 2790,
},
output = [], offset = 0, zero = 0, nine = 0, char = 0;
source = source.toLowerCase();
target = target.toLowerCase();
if (!(source in systems && target in systems) || input == null || typeof input == "undefined" || typeof input == "object") {
return input;
}
input = input.toString();
offset = systems[target] - systems[source];
zero = systems[source];
nine = systems[source] + 9;
for (var i = 0 ; i < input.length; i++) {
var char = input.charCodeAt(i);
if (char >= zero && char <= nine) {
output.push(String.fromCharCode(char + offset));
} else {
output.push(input[i]);
}
}
return output.join("");
}
var res = ConvertDigits('१२३४५६७८९', 'hindi', 'english');
I got it from here
If you need a jquery thing then please try this link
Use the Globalize library.
Install it
npm install globalize cldr-data --save
then
var cldr = require("cldr-data");
var Globalize = require("globalize");
Globalize.load(cldr("supplemental/likelySubtags"));
Globalize.load(cldr("supplemental/numberingSystems"));
Globalize.load(cldr("supplemental/currencyData"));
//replace 'hi' with appropriate language tag
Globalize.load(cldr("main/hi/numbers"));
Globalize.load(cldr("main/hi/currencies"));
//You may replace the above locale-specific loads with the following line,
// which will load every type of CLDR language data for every available locale
// and may consume several hundred megs of memory!
//Use with caution.
//Globalize.load(cldr.all());
//Set the locale
//We use the extention u-nu-native to indicate that Devanagari and
// not Latin numerals should be used.
// '-u' means extension
// '-nu' means number
// '-native' means use native script
//Without -u-nu-native this example will not work
//See
// https://en.wikipedia.org/wiki/IETF_language_tag#Extension_U_.28Unicode_Locale.29
// for more details on the U language code extension
var hindiGlobalizer = Globalize('hi-IN-u-nu-native');
var parseHindiNumber = hindiGlobalizer.numberParser();
var formatHindiNumber = hindiGlobalizer.numberFormatter();
var formatRupeeCurrency = hindiGlobalizer.currencyFormatter("INR");
console.log(parseHindiNumber('३,५००')); //3500
console.log(formatHindiNumber(3500)); //३,५००
console.log(formatRupeeCurrency(3500)); //₹३,५००.००
https://github.com/codebling/globalize-example
A common scenario for this problem is to display a float number to the user and then want it back as a numerical value.
In that case, javascript has the number in the first place and looses it when formatting it for display. A simple workaround for the parsing is to store the real float value along with the formatted value:
var number = 3500;
div.innerHTML = number.toLocaleString("hi-IN");
div.dataset.value = number;
Then get it back by parsing the data attribute:
var number = parseFloat(div.dataset.value);
This is a Columbus's egg style answer. It works provided the problem is an egg.
var number = 3500;
var toLocaleString = number.toLocaleString("hi-IN")
var formatted = toLocaleString.replace(',','')
var converted = parseInt(formatted)

deObfuscating in Python using transformed JS function

I needed to convert the following function to python to deobfuscate a text extracted while web scraping:
function obfuscateText(coded, key) {
// Email obfuscator script 2.1 by Tim Williams, University of Arizona
// Random encryption key feature by Andrew Moulden, Site Engineering Ltd
// This code is freeware provided these four comment lines remain intact
// A wizard to generate this code is at http://www.jottings.com/obfuscator/
shift = coded.length
link = ""
for (i = 0; i < coded.length; i++) {
if (key.indexOf(coded.charAt(i)) == -1) {
ltr = coded.charAt(i)
link += (ltr)
}
else {
ltr = (key.indexOf(coded.charAt(i)) - shift + key.length) % key.length
link += (key.charAt(ltr))
}
}
document.write("<a href='mailto:" + link + "'>" + link + "</a>")
}"""
here is my converted python equivalent:
def obfuscateText(coded,key):
shift = len(coded)
link = ""
for i in range(0,len(coded)):
inkey=key.index(coded[i]) if coded[i] in key else None
if ( not inkey):
ltr = coded[i]
link += ltr
else:
ltr = (key.index(coded[i]) - shift + len(key)) % len(key)
link += key[ltr]
return link
print obfuscateText("uw#287u##Guw#287Xw8Iwu!#W7L#", "WXYVZabUcdTefgShiRjklQmnoPpqOrstNuvMwxyLz01K23J456I789H.#G!#$F%&E'*+D-/=C?^B_`{A|}~")
actionattraction$comcastWnet
but I am getting a slightly incorrect output instead of actionattraction#comcast.net I get above. Also many a times the above code gives random characters for the same html page,
The target html page has a obfuscateText function in JS with the coded and key, I extract the function signature in obsfunc and execute it on the fly:
email=eval(obsfunc)
which stores the email in above variable, but the problem is that it works most of the time but fails certain times , I strongly feel that the problem is with the arguments supplied to the python function , they may need escaping or conversion as it contains special characters? I tried passing raw arguments and different castings like repr() but the problem persisted.
Some examples for actionattraction#comcast.net wrongly computed and correctly computed using the same python function(first line is email):
#ation#ttr#ationVaoma#st!nct
obfuscateText("KMd%Y#Kdd8KMd%Y#IMY!MKcdJ#*d", "utvsrwqxpyonzm0l1k2ji3h4g5fe6d7c8b9aZ.Y#X!WV#U$T%S&RQ'P*O+NM-L/K=J?IH^G_F`ED{C|B}A~")
}ction}ttr}ction#comc}st.net
obfuscateText("}ARGML}RRP}ARGMLjAMKA}QRiLCR", "}|{`_^?=/-+*'&%$#!#.9876543210zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA~")
actionattraction#comcast.net
obfuscateText("DEWLRQDWWUDEWLRQoERPEDVWnQHW", "%&$#!#.'9876*54321+0zyxw-vutsr/qponm=lkjih?gfed^cbaZY_XWVUT`SRQPO{NMLKJ|IHGFE}DCBA~")
I've rewritten the deobfuscator:
def deobfuscate_text(coded, key):
offset = (len(key) - len(coded)) % len(key)
shifted_key = key[offset:] + key[:offset]
lookup = dict(zip(key, shifted_key))
return "".join(lookup.get(ch, ch) for ch in coded)
and tested it as
tests = [
("KMd%Y#Kdd8KMd%Y#IMY!MKcdJ#*d", "utvsrwqxpyonzm0l1k2ji3h4g5fe6d7c8b9aZ.Y#X!WV#U$T%S&RQ'P*O+NM-L/K=J?IH^G_F`ED{C|B}A~"),
("}ARGML}RRP}ARGMLjAMKA}QRiLCR", "}|{`_^?=/-+*'&%$#!#.9876543210zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA~"),
("DEWLRQDWWUDEWLRQoERPEDVWnQHW", "%&$#!#.'9876*54321+0zyxw-vutsr/qponm=lkjih?gfed^cbaZY_XWVUT`SRQPO{NMLKJ|IHGFE}DCBA~"),
("ZUhq4uh#e4Om.04O", "ksYSozqUyFOx9uKvQa2P4lEBhMRGC8g6jZXiDwV5eJcAp7rIHL31bnTWmN0dft")
]
for coded,key in tests:
print(deobfuscate_text(coded, key))
which gives
actionattraction#comcast.net
actionattraction#comcast.net
actionattraction#comcast.net
anybody#home.com
Note that all three key strings contain &; replacing it with & fixes the problem. Presumably at some point the javascript was mistakenly html-code-escaped; Python has a module which will unencode html special characters like so:
# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)
# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)
First of all, index doesn't return None, but throws an exception. In your case, W appears instead of a dot because the index returned is 0, and not inkey (which is also wrong) mistakenly beleive that a character is not present in the key.
Second, presence of & suggests that you indeed may have to find and decode HTML entities.
Finally, I'd recommend to rewrite it like
len0 = len(code)
len1 = len(key)
link = ''
for ch in code:
try:
ch = key[(key.index(ch) - len0 + len1) % len1]
except ValueError: pass
link += ch
return link

jQuery / JavaScript Parsing strings the proper way

Recently, I've been attempting to emulate a small language in jQuery and JavaScript, yet I've come across what I believe is an issue. I think that I may be parsing everything completely wrong.
In the code:
#name Testing
#inputs
#outputs
#persist
#trigger
print("Test")
The current way I am separating and parsing the string is by splitting all of the code into lines, and then reading through this lines array using searches and splits. For example, I would find the name using something like:
if(typeof lines[line] === 'undefined')
{
}
else
{
if(lines[line].search('#name') == 0)
{
name = lines[line].split(' ')[1];
}
}
But I think that I may be largely wrong on how I am handling parsing.
While reading through examples on how other people are handling parsing of code blocks like this, it appeared that people parsed the entire block, instead of splitting it into lines as I do. I suppose the question of the matter is, what is the proper and conventional way of parsing things like this, and how do you suggest I use it to parse something such as this?
In simple cases like this regular expressions is your tool of choice:
matches = code.match(/#name\s+(\w+)/)
name = matches[1]
To parse "real" programming languages regexps are not powerful enough, you'll need a parser, either hand-written or automatically generated with a tool like PEG.
A general approach to parsing, that I like to take often is the following:
loop through the complete block of text, character by character.
if you find a character that signalizes the start of one unit, call a specialized subfunction to parse the next characters.
within each subfunction, call additional subfunctions if you find certain characters
return from every subfunction when a character is found, that signalizes, that the unit has ended.
Here is a small example:
var text = "#func(arg1,arg2)"
function parse(text) {
var i, max_i, ch, funcRes;
for (i = 0, max_i = text.length; i < max_i; i++) {
ch = text.charAt(i);
if (ch === "#") {
funcRes = parseFunction(text, i + 1);
i = funcRes.index;
}
}
console.log(funcRes);
}
function parseFunction(text, i) {
var max_i, ch, name, argsRes;
name = [];
for (max_i = text.length; i < max_i; i++) {
ch = text.charAt(i);
if (ch === "(") {
argsRes = parseArguments(text, i + 1);
return {
name: name.join(""),
args: argsRes.arr,
index: argsRes.index
};
}
name.push(ch);
}
}
function parseArguments(text, i) {
var max_i, ch, args, arg;
arg = [];
args = [];
for (max_i = text.length; i < max_i; i++) {
ch = text.charAt(i);
if (ch === ",") {
args.push(arg.join(""));
arg = [];
continue;
} else if (ch === ")") {
args.push(arg.join(""));
return {
arr: args,
index: i
};
}
arg.push(ch);
}
}
FIDDLE
this example just parses function expressions, that follow the syntax "#functionName(argumentName1, argumentName2, ...)". The general idea is to visit every character exactly once without the need to save current states like "hasSeenAtCharacter" or "hasSeenOpeningParentheses", which can get pretty messy when you parse large structures.
Please note that this is a very simplified example and it misses all the error handling and stuff like that, but I hope the general idea can be seen. Note also that I'm not saying that you should use this approach all the time. It's a very general approach, that can be used in many scenerios. But that doesn't mean that it can't be combined with regular expressions for instance, if it, at some part of your text, makes more sense than parsing each individual character.
And one last remark: you can save yourself the trouble if you put the specialized parsing function inside the main parsing function, so that all functions have access to the same variable i.

Categories