How can I get a massive text file into a JavaScript array? - javascript

I'm using a list of words with positive and negative sentiment from AFINN to do some text analysis.
Problem is, the list comes in a .txt file in the following format (word on the left, pos vs neg index at right):
casualty -2
catastrophe -3
catastrophic -4
cautious -1
celebrate 3
celebrated 3
celebrates 3
celebrating 3
To work with it, I need it in the following format:
var array = [{word:"casualty",score:-2},{word:"catastrophe",score:-3},{word:"catastrophic",score:-4}, etc etc]
I'd actually prefer to do this once with a shell script, rather than in the browser. Which is why I'm thinking Node.js could come in handy here. But I'm not very familiar with Node.
Direct link to the zip containing the raw text files.

In case you don't really care about how to read text into a javascript array, and you just need AFINN in JSON, I just found a version here.

Related

Converting between Bases, and from a String to any Base N in JavaScript (Radix conversion)

First post on here!
I've done a couple hours of research, and I can't seem to find any actual answers to this, though it may be my understanding that's wrong.
I want to convert a string, lets say "Hello 123" into any Base N, lets say N = 32 for simplicity.
My Attempt
Using Javascript's built-in methods (Found through other websites, and):
stringToBase(string, base) {
return parseInt(string, 10).toString(base);
}
So, this encodes the string to base 10 (decimal) and then into the base I want, however the caveat with this is that it only works from 2 to 36, which is good, but not really in the range that I'm looking for.
More
I'm aware that I can use the JS BigInt, but I'm looking to convert with bases as high as 65536 that uses an arbitrary character set that does not stop when encountering ASCII or (yes I'm aware it's completely useless, I'm just having some fun and I'm very persistent). Most solutions I've seen use an alphabet string or array (e.g. "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz+-").
I've seen a couple threads that say that encoding to a radix which is not divisible by 2 won't work, is that true? Since base 85, 91, exist.
I know that the methods atob() and btoa() exist, but this is only for Radix/Base 64.
Some links:
I had a look at this github page: https://github.com/gliese1337/base-to-base/blob/main/src/index.ts , but it's in typescript and I'm not even sure what's going on.
This one is in JS: https://github.com/adanilo/base128codec/blob/master/b128image.js . It makes a bit more sense than the last one, but the fact there is a whole github page just for Base 128 sort of implies that they're all unique and may not be easily converted.
This is the aim of the last and final base: https://github.com/qntm/base65536 . The output of "Hello World!" for instance, is "驈ꍬ啯𒁗ꍲ噤".
(I can code java much better than JS, so if there is a java solution, please let me know as well)

For char golfing in javascript, encode with an encoding, decode with another

I'm doing char golfing these days in different languages and I was skeptic at first cause it's totally disconnected from 'real world' practices but I ended up loving it for its educationnal purpose: I learned a LOT about my languages in the process.
And let's admit it, it's fun.
I'm currently trying to learn tricks in JS and here's the last I found:
Say, you have this script:
for(i=5;i--;)print(i*i) (23 chars)
The script is made of ASCII chars, each of them is basically a pair of hex digits.
For example 'f' is 66 and 'o' is 6f.
So if you group the informations of these two chars you get: 666f, which the utf16 code of one char: 景
My script has an odd number of chars so let's add a space somewhere to make it even:
for(i=5;i--;) print(i*i) (24 chars)
and now by applying the previous idea to the whole script we get:
景爨椽㔻椭ⴻ⤠灲楮琨椪椩 (12 chars)
So now my question is: how can I reconstruct the script back from the 12 chars with as few chars as possible?
I came up with that:
eval(unescape(escape`景爨椽㔻椭ⴻ⤠灲楮琨椪椩`.replace(/%u(..)/g,'%$1%')))
but it adds a constant cost of 50 chars to the process so it makes this method useless if your script has less than 100 chars.
It's great for long scripts (e.g. 600 chars becomes 350 chars) but in golfing problems, the script is rarely long, usually it's less than 100 chars.
I'm not an encoding specialist at all, that's why I came here cause I'm pretty sure there's a shorter method.
30 chars of constant cost would be already amazing cause it would make the threshold drop from 100 to 60 chars.
Note that I used utf16 here but it could be another encoding, as long as it shortens the script I'm happy with it.
My Version of JS is: Node 12.13.0
The standard way to switch between string decodings in node.js is to use the Buffer api:
Buffer.from(…, "utf16le").toString("ascii")
To golf this a bit, you can take advantage of some legacy options and defaults:
''+new Buffer(…,"ucs2")
(The .toString() without arguments actually does use UTF-8 but it doesn't matter for ASCII data)
Since node only supports UTF16-le instead of UTF16-be your string won't work, you'll need to swap the bytes and use different characters though:
global.print = console.log;
eval(''+new Buffer("潦⡲㵩㬵⵩㬭
牰湩⡴⩩⥩","ucs2"))
(online demo)

Convert number to fraction using symbols like ¼ ½ ¾ (in google spreadsheets)

I would like to convert a number to a fraction in Google spreadsheets (I'm on the latest Firefox, Windows 7). After some searching I managed to find this formula which works well for me (It converts cell A1 into a fraction):
=IF(INT(A1)=A1,A1,IF(INT(A1)>0 ; INT(A1) &" " ; "")&(INDEX(SPLIT((TEXT( MOD(A1;1);"000000000000000E000")); "E");1;1))/GCD((INDEX(SPLIT((TEXT( MOD(A1;1); "000000000000000E000"));"E");1;1));(10^(-1*INDEX(SPLIT((TEXT( MOD(A1;1);"000000000000000E000")); "E");1;2))))&"/"& (10^(-1*INDEX(SPLIT((TEXT( MOD(A1;1);"000000000000000E000")); "E");1;2)))/GCD((INDEX(SPLIT((TEXT( MOD(A1;1); "000000000000000E000"));"E");1;1));(10^(-1*INDEX(SPLIT((TEXT( MOD(A1;1);"000000000000000E000")); "E");1;2)))))
What I would like to do is use symbols like ¼ ½ and ¾ (so that for example the formula converts 1.5 to "1 ½"). These are symbols I copied and pasted from Microsoft Word. They paste into Google spreadsheets fine but as text, so I'm guessing I need to add Concatenate into this formula, but I don't know how.
My value in cell A1 will always be a multiple of 0.25 (e.g. 0.5 or 2 or 3.75 etc), so I will only need the symbols ¼ ½ and ¾.
If anyone know how do this preferably with a formula or otherwise a script I would be very grateful indeed.
Considering that you only have multiples of 0.25, the following formula serves to you. If you have a value that it is not multiple of 0.25, it will fails.
=IF(INT(A1)=A1,A1,IF(INT(A1)>0 ; INT(A1) &" " ; "")&(if(A1-INT(A1)>0.5; "¾"; if(A1-INT(A1)>0.25; "½"; "¼"))))
Various decimal-to-fraction converters have been implemented in Javascript, for example this one.
That function could be imported and used as a custom spreadsheet function. It is a simple extension to replace fractions with the equivalent ASCII or Unicode characters.
A custom function will be far more readable than implementing a complex spreadsheet function.
=dec2frac(A1)

Getting the first dollar amount using JavaScript with Razor and .Net

I asked a similar question on how to do this on the server side (SQL), however it makes more sense to accomplish this on the client side, based on the app architecture.
I've got a MVC3 app with Razor on the .Net framework, where I have model data available that I would like to parse and return the first dollar value from a given string using Javascript / regex,
For example, each of the following lines represents a sample data set:
Used knife set for sale $200.00 or best offer.
$4,500 Persian rug for sale.
Today only, $100 rebate.
Five items for sale: $20 Motorola phone car charger, $150 PS2, $50.00 3 foot high shelf.
I've seen a few issues already including the # in JS and a few other pitfalls I would like to try to avoid.
Thanks.
var m = line.match(/\$[0-9,]+\.?\d*/);
if (m)
return m[0];
should give you a hint. This Regex returns you a string which consists of a dollar sign, some numbers or commata, and optional a dot another few numbers behind it. You might want to limit its wideness (only 2 decimals, not starting with zero etc).

identify song information from mp3 id3v2 tags

i already have mp3 binary data, i just want to know how can i extract info from it. v1 is easy, take last 128 characters and you are done. but v2 has variable length. documentation says that tag size will be in header but i was unable to find it in any song i tested.
but anyways i simply want to extract album and artist info.. jsut these two, with javascript. lets take for simplicity sake that i have first 2000 bytes of a Taylor swift song in a variable (below is the actual binary data of a song):
ID3!vTYER2010TIT2
Last KissMCDI¬E+96+4484+918B+E800+12F4B+1A636+1EC24+23A8E+2905F+2F7DD+33868+3914B+3D931+44555+4A27BTRCK13TCON(2)CountryPRIVPeakValue¡PRIVAverageLevel{ TPE2
Taylor SwiftPRIV)WM/MediaClassSecondaryIDPRIV'WM/MediaClassPrimaryID¼}`Ñ#ãâK¡H¤*(DPRIVWM/ProviderAMGPRIVWM/WMContentIDÇ1t>êDëþëPRIV"WM/WMCollectionID ¨F}âH"Y#7 ÈPRIV'WM/WMCollectionGroupID ¨F}âH"Y#7 ÈTPUBBig MachinePRIVWM/UniqueFileIdentifierAMGa_id=R 2026672;AMGp_id=P 816977;AMGt_id=T 22057912TALB
Speak NowTPE1
Taylor SwiftTLEN369120ÿûà#üK
now i can easily locate the album and artist name (last two lines). and i can also find where the data begins with js pretty easily. just locate TALB and TPE1. simple. but how in the world do i know where the data ends..? they may or may not be adjacent to each other in other songs. they may or may not be uppercase. how do all the other libraries figure out where the data ends?
also there is no 'size' in the beginning as the documentation suggests.
EDIT can anyone help me out please? i really need this
The binary sample you show is missing some data. An ID3 version 2.4 tag frame header is 10 bytes in length and consists of the following fields:
ID -- 4 bytes (e.g. TIT2)
Size -- 4 bytes (is sync-safe in versions >= 2.4)
Flags -- 2 bytes
The size field tells you how many bytes of data are in that particular frame. Similarly the actual tag header is 10 bytes as well:
ID -- 3 bytes (always ID3)
Version -- 2 bytes (major version and revision. e.g. 0x04 0x00 indicates a 2.4.0 tag version)
Flags -- 1 byte
Size -- 4 bytes (is sync-safe in versions >= 2.3)
See: http://id3.org/id3v2.4.0-structure
Once your script has the binary data, you can parse these size fields to determine the size of the complete tag as well as the size of each frame. Once you get to this point you're going to run into sync-safe integers.
See: Why are there Synchsafe Integer?
Try this library, looks like it does what you need.

Categories