MurmurHash3_32 Java returns negative numbers - javascript

I am trying to replicate the file hashing of MobileSheetsPro, an Android app, where there is a hashcodes.txt which includes a hash for each file, as well as the path, last modified date and filesize. We'll just focus on the hashing part.
So for the random song I uploaded here if you want to try it for yourself, I am using the murmurhash-native npm package to convert it to a buffer and then hash it like so:
const fs = require("fs");
const { promisify } = require("util");
const { murmurHash } = require("murmurhash-native");
const readFileAsync = promisify(fs.readFile);
async function hashcodeObjFromFilePath(filepath) {
const buf = await readFileAsync(filepath);
const h = murmurHash(buf);
console.log(h);
}
This prints out a hash of 4275668817 when using the default seed of 0 and 3020822739 when using the seed 0xc58f1a7b as a second argument.
The problem: the app seems to calculate it differently. The developer wrote the following, but I don't see that exact function in the code he linked:
Check this out: github link
Those are the classes I've utilized. I call
Hashing.goodFast32Hash(HASH_KEY)) where HASH_KEY is equal to
0xC58F1A7B.
EDIT I've got more infos from the dev:
I call Files.hash(file, Hashing.goodFast32Hash(HASH_KEY)); Using the
return value from that, I call "asInt()" on the HashCode object that
is returned. So it's a signed integer value (negative values are just
fine). And yes, HASH_KEY is the seed value passed to the function.
Since I'm not good at Java I still have no idea to replicate this in node-js...
That's all the info I have, folks.
Anyone see where I'm going wrong?

Found it! The asInt() function in the Java lib returns a signed int32 *in little endian byte order
The following is probably not the easiest way but the code
const h = murmurHash(buf, "buffer", 0xc58f1a7b);
// console.log(parseInt(h, 16).toString(2));
const fromHashFile = Buffer.alloc(4);
fromHashFile.writeInt32BE(-1274144557);
console.log(fromHashFile);
console.log(h);
console.log(h.readInt32BE());
console.log("hash from hashcodes file: -1274144557");
Prints out the following to console:
<Buffer b4 0e 18 d3>
<Buffer b4 0e 18 d3>
-1274144557
hash from hashcodes file: -1274144557

Related

reading avro compressed data with snappy generates an error of "snappy codec decompression error"

I have an application (KafkaConnect) that is generating me avro files into S3.
This files are compressed with avro code "snappy".
I'm trying to read them with javascript (I'm not a very strong javascript developer as you will be able to guess).
I tried to use avro-js or avsc as libraries to help me with this since they are referenced in most of the online examples I found for doing this.
The most complete example and very useful I found was here.
Anyway it seems most examples I found are using snappy version 6 which seems to be a bit different than version 7 (the latest).
One of the main things I noticed is that it now provides two methods of uncompress. One with sync and another which returns a promise, but none that can receive a call back function.
Anyway I think this is not an issue because I could make it work regardless, but my best example to read this files would be something like this (with avsc).
const avsc = require('avsc');
const avsc = require('avsc');
const snappy = require('snappy');
const codecs = {
snappy: function (buf, cb) {
// Avro appends checksums to compressed blocks, which we skip here.
const buffer = snappy.uncompressSync(buf.slice(0, buf.length - 4));
return cb(buffer);
}
};
avsc.createFileDecoder('person-10.snappy.avro', {codecs})
.on('metadata', function (writerType) {
console.log(writerType.name);
})
.on('data', function (obj) {
console.log('on data ');
console.log('obj');
})
.on('end', function () {
console.log('end');
});
Anyway the processing of metadata works without issues (I can access the full schema information) but the data always fails with
Uncaught Error: snappy codec decompression error
I'm looking for someone that has by some reason worked with avro and snappy in the latest versions and managed to make this work.
Because I'm really struggling with understanding this I created a fork of the official avsc repo and tried to introduce my examples there to see how this work but if more useful I could try and create a simpler
reproducible scenario.
the documentation of the package I was using was updated and now the problem is fixed.
https://github.com/mtth/avsc/wiki/API#class-blockdecoderopts
mainly I was just wrong on how to call the call back function and how to handle the buffer to snappy.
this is the correct way (as documented)
const crc32 = require('buffer-crc32');
const snappy = require('snappy');
const blockDecoder = new avro.streams.BlockDecoder({
codecs: {
snappy: (buf, cb) => {
// Avro appends checksums to compressed blocks.
const len = buf.length;
const checksum = buf.slice(len - 4, len);
snappy.uncompress(buf.slice(0, len - 4))
.then((inflated) => {
if (!checksum.equals(crc32(inflated))) {
// We make sure that the checksum matches.
throw new Error('invalid checksum');
}
cb(null, inflated);
})
.catch(cb);
}
}
});

Equivalent JavaScript code to generate SHA256 hash

I have this code in Java that generates a SHA256 hash:
Hashing.sha256().hashString(value,Charsets.UTF_16LE).toString()
I'm trying to do the same on JavaScript/Node, that having the same value returns the same result.
I tried usind crypto-js but without success (it returns a hash string but different from the one generated with the Java code).
I tried this, for example:
import * as sha256 from 'crypto-js/sha256';
import * as encutf16 from 'crypto-js/enc-utf16';
...
let utf16le = encutf16.parse(key);
let utf16Sha256 = sha256(utf16le);
let utf16Sha256String = utf16Sha256.toString();
Can you try something like this :-
const CryptoJS = require('crypto-js');
let utf16le = CryptoJS.enc.Utf16LE.parse(word);
let utf16Sha256 = CryptoJS.SHA256(utf16le);
return utf16Sha256.toString(CryptoJS.enc.Hex);
Or else if you can give a sample of whats the input and expected output corresponding to JAVA code it will be easier

Random comma inserted at character 8192 in python "json" result called from node.js

I'm a JS developer just learning python. This is my first time trying to use node (v6.7.0) and python (v2.7.1) together. I'm using restify with python-runner as a bridge to my python virtualenv. My python script uses a RAKE NLP keyword-extraction package.
I can't figure out for the life of me why my return data in server.js inserts a random comma at character 8192 and roughly multiples of. There's no pattern except the location; Sometimes it's in the middle of the object key string other times in the value, othertimes after the comma separating the object pairs. This completely breaks the JSON.parse() on the return data. Example outputs below. When I run the script from a python shell, this doesn't happen.
I seriously can't figure out why this is happening, any experienced devs have any ideas?
Sample output in browser
[..., {...ate': 1.0, 'intended recipient': 4.,0, 'correc...}, ...]
Sample output in python shell
[..., {...ate': 1.0, 'intended recipient': 4.0, 'correc...}, ...]
DISREGARD ANY DISCREPANCIES REGARDING OBJECT CONVERSION AND HANDLING IN THE FILES BELOW. THE CODE HAS BEEN SIMPLIFIED TO SHOWCASE THE ISSUE
server.js
var restify = require('restify');
var py = require('python-runner');
var server = restify.createServer({...});
server.get('/keyword-extraction', function( req, res, next ) {
py.execScript(__dirname + '/keyword-extraction.py', {
bin: '.py/bin/python'
})
.then( function( data ) {
fData = JSON.parse(data); <---- ERROR
res.json(fData);
})
.catch( function( err ) {...});
return next();
});
server.listen(8001, 'localhost', function() {...});
keyword-extraction.py
import csv
import json
import RAKE
f = open( 'emails.csv', 'rb' )
f.readline() # skip line containing col names
outputData = []
try:
reader = csv.reader(f)
for row in reader:
email = {}
emailBody = row[7]
Rake = RAKE.Rake('SmartStoplist.txt')
rakeOutput = Rake.run(emailBody)
for tuple in rakeOutput:
email[tuple[0]] = tuple[1]
outputData.append(email)
finally:
file.close()
print( json.dumps(outputData))
This looks suspiciously like a bug related to size of some buffer, since 8192 is a power of two.
The main thing here is to isolate exactly where the failure is occurring. If I were debugging this, I would
Take a closer look at the output from json.dumps, by printing several characters on either side of position 8191, ideally the integer character code (unicode, ASCII, or whatever).
If that looks OK, I would try capturing the output from the python script as a file and read that directly in the node server (i.e. don't run a python script).
If that works, then create a python script that takes that file and outputs it without manipulation and have your node server execute that python script instead of the one it is using now.
That should help you figure out where the problem is occurring. From comments, I suspect that this is essentially a bug that you cannot control, unless you can increase the python buffer size enough to guarantee your data will never blow the buffer. 8K is pretty small, so that might be a realistic solution.
If that is inadequate, then you might consider processing the data on the the node server, to remove every character at n * 8192, if you can consistently rely on that. Good luck.

How to message child process in Firefox add-on like Chrome native messaging

I am trying to emulate Chrome's native messaging feature using Firefox's add-on SDK. Specifically, I'm using the child_process module along with the emit method to communicate with a python child process.
I am able to successfully send messages to the child process, but I am having trouble getting messages sent back to the add-on. Chrome's native messaging feature uses stdin/stdout. The first 4 bytes of every message in both directions represents the size in bytes of the following message so the receiver knows how much to read. Here's what I have so far:
Add-on to Child Process
var utf8 = new TextEncoder("utf-8").encode(message);
var latin = new TextDecoder("latin1").decode(utf8);
emit(childProcess.stdin, "data", new TextDecoder("latin1").decode(new Uint32Array([utf8.length])));
emit(childProcess.stdin, "data", latin);
emit(childProcess.stdin, "end");
Child Process (Python) from Add-on
text_length_bytes = sys.stdin.read(4)
text_length = struct.unpack('i', text_length_bytes)[0]
text = sys.stdin.read(text_length).decode('utf-8')
Child Process to Add-on
sys.stdout.write(struct.pack('I', len(message)))
sys.stdout.write(message)
sys.stdout.flush()
Add-on from Child Process
This is where I'm struggling. I have it working when the length is less than 255. For instance, if the length is 55, this works:
childProcess.stdout.on('data', (data) => { // data is '7' (55 UTF-8 encoded)
var utf8Encoded = new TextEncoder("utf-8).encode(data);
console.log(utf8Encoded[0]); // 55
}
But, like I said, it does not work for all numbers. I'm sure I have to do something with TypedArrays, but I'm struggling to put everything together.
The problem here, is that Firefox is trying to read stdout as UTF-8 stream by default. Since UTF-8 doesn't use the full first byte, you get corrupted characters for example for 255. The solution is to tell Firefox to read in binary encoding, which means you'll have to manually parse the actual message content later on.
var childProcess = spawn("mybin", [ '-a' ], { encoding: null });
Your listener would then work like
var decoder = new TextDecoder("utf-8");
var readIncoming = (data) => {
// read the first four bytes, which indicate the size of the following message
var size = (new Uint32Array(data.subarray(0, 4).buffer))[0];
//TODO: handle size > data.byteLength - 4
// read the message
var message = decoder.decode(data.subarray(4, size));
//TODO: do stuff with message
// Read the next message if there are more bytes.
if(data.byteLength > 4 + size)
readIncoming(data.subarray(4 + size));
};
childProcess.stdout.on('data', (data) => {
// convert the data string to a byte array
// The bytes got converted by char code, see https://dxr.mozilla.org/mozilla-central/source/addon-sdk/source/lib/sdk/system/child_process/subprocess.js#357
var bytes = Uint8Array.from(data, (c) => c.charCodeAt(0));
readIncoming(bytes);
});
Maybe is this similar to this problem:
Chrome native messaging doesn't accept messages of certain sizes (Windows)
Windows-only: Make sure that the program's I/O mode is set to O_BINARY. By default, the I/O mode is O_TEXT, which corrupts the message format as line breaks (\n = 0A) are replaced with Windows-style line endings (\r\n = 0D 0A). The I/O mode can be set using __setmode.

node.js / Write buffer to file

I have a file, with size of 108 bytes.
I want to add to this file some text (buffer), let say "Hello world".
So I wrote the next:
fs.open("./tryit.txt", 'w+', function (err, fd1) {
var buffer = new Buffer("hello world");
fs.write(fd1, buffer, 0, 11, 109, function (err, bytesWrite, buffer) {
})
})
In order to write the file from position of 109.
I see that it write it, but before the hello world, all the text of the file was replaced by the NUL character.
How can I do it? append is not an option, because in some cases I want to write to the middle of the file.
What you want is random access IO (read or write at a specific point in a file).
It's not provided in the default API but you may use an additional package like https://www.npmjs.org/package/random-access-file
From docs:
'w+' - Open file for reading and writing. The file is created (if it does not exist) or truncated (if it exists)
"truncated" means that file becomes empty once opened.
You need a different mode, r+ for instance. a also might work, but not on Linux, according to docs.

Categories