I am using pdfkit to dynamically generate PDF documents within a nodewebkit application. The PDFs contain people's comments coming from a remote source via an HTTP request.
It works really well, however now I spotted that when a comment is in Japanese, Chinese, Arabic, etc. it doesn't render correctly, and I have no means of knowing what language the comments will be coming in—in fact I am gathering them from around the world.
I understood that I need to use the right font that should have proper characters included, as explained here. I spotted this "google noto" open font which has it all, but the problem is that there is no single TTF file with all languages, and there can't be as font files are limited to 65K glyphs.
I am trying to find a solution that lets render text in (almost) any language within a PDF using pdfkit, without having to write a sophisticated language recognition tool, which I feel would be an overkill.
Any thoughts and suggestions will be much appreciated.
UPDATE: Use font-manager by the author of pdfkit to substitute the font. Also you may want to try phantomJS—I haven't done that though. See detailed response by #levi in the comments if you have the same problem. Hope it helps.
Here is one idea. Download all the fonts for the most popular languages. Add them to a list, and sort it by most popular. Foreach comment, get the unicode values for n random character's within the string. Foreach character, if code > 127 (ASCII range) comment may not be English. Using opentype.js, parse the font files one by one, foreach font, check the cmap table if there exists glyph's for all the character codes sampled. If there does, then choose that font, and cache a mapping between unicode code to font. Otherwise, try next font.
Upon further consideration, it seems TTF files provide info on the unicode ranges they support via the UnicodeRange field. So perhaps you could build a mapping between each font and the unicode ranges it supports, and use this to select the correct font, instead of parsing each font at run-time.
Related
I'm working on a project which accepts user input of a name and subsequently navigates to a website to scrape data related to that name. Everything is going well, except when users input non-ASCII characters, accented characters, and Non-Western characters. I'm looking for the simplest way to store those characters in a string without having javascript convert them to a "�".
I've done some research on the issue and found similar questions to mine, but they all seem to address removing accents from characters with accent folding, rather than simply storing those characters for later use.
I am using the readline-sync Node module to simplify the process of requesting user input. If that is part of the problem, please let me know! Here is the entirety of the code from my test algorithm:
const rlSync = require('readline-sync');
const name = await rlSync.question('Enter player name (Case Sensitive): ');
console.log(name);
This is all of the code from the test algorithm where the issue arises, so I know the source is not elsewhere. The primary test case I have been using up to this point has been any name with the letter "ë", although that is not the only problematic character. When I type "Hëllo" in the input prompt, the program outputs "H�llo".
Thank you all so much for any help you can provide! <3
UPDATE based on everyone's responses and a bunch of research: I think y'all are right about the console settings being an issue, rather than the code. Does anyone have a suggestion as to a good alternative CLI that uses UTF-8, or a means of updating the settings in the Windows command prompt to do so?
My Windows version is 10.0.18362.267. I have tried setting the language to "Beta: use UTF-8" via the administrative language settings, but this seems to present another issue: Instead of printing "H�llo", the cmd printed "Hllo".
(If this is beyond the scope of this forum I totally understand... just hoping to get as much help as I can!) :-)
I re-read your question... I don't recall the node.js bit being there before, but....
Your issue is not in your program. It is the settings in your terminal. You need to change your terminals settings to use UTF-8 and a font capable of displaying those characters. Or switch to a terminal that can.
If your terminal only understands ASCII or is set to wrong encoding, it's showing the replacement character because it can't display them.
Node.js uses UTF-8 by default, so internally all should be well.
**Note:
I checked up on readline-sync to be sure it's not the problem, and what I read seems to support this hypothesis.
https://github.com/anseki/readline-sync/issues/58
ECMAScript (Node.JS) already supports Unicode, by default. If your
environment (not readlineSync) does not support those characters (e.g.
you use Windows), the console.log method in your code can not print
those when the answer contains those characters.
Old answer:
If your seeing that symbol in place of characters, it is almost certainly a font issue rather than a javascript issue. Try using a font that supports these characters. How you do this depends on what your viewing the output with (i.e. terminal, browser, etc). If that doesn't work, you may need to specificy using utf8 as well, and also depends on the same.
This seems an issue of your text encoding settings on your server. If stored in a DB then maybe not in UTF-8, if happens directly in node on output, reading from a file and output in console, then you must make sure to specify to use UTF-8 if reading from a file. If happening like with you using node cli and reading from console input this is your text encoding engine that doesn't support multibyte.
So this is a settings issue so make sure all is in UTF-8 or even 16 since multibyte must be supported as all accents are stored that what cause they need a second memory space for the accent...
I am an amateur programmer and I want to make a mulitlingual website. My questions are: (For my purposes let the English website be website nr 1 and the Polish nr 2)
Should it be en.example.com and pl.example.com or maybe example.com/en and example.com/pl?
Should I make the full website in one language and then translate it?
How to translate it? Using XML or what? Should website 1 and website 2 be different html files or is there a way to translate a html file and then show the translation using XML or something?
If You need any code or something tell me. Thank You in advance :)
1) I don't think it makes much difference. The most important thing is to ensure that Google can crawl both languages, so don't rely on JavaScript to switch between languages, have everything done server side so both languages can be crawled and ranked in Google.
2) You can do one translation then the other, you just have to ensure that the layout expands to accommodate more/less text without breaking it. Maybe use lorem ipsum whilst designing.
3) I would put the translations into a database and then call that particular translation depending on whether it is EN or PL in the domain name. Ensure that the webpage and database are UTF-8 encoding otherwise you will find that you get 'funny' characters being displayed.
My Advice is that you start to use any Framework.
For instance if you use CakePHP then you have to write
__('My name is')
and in translate file
msgid "My name is"
msgstr "Nazywam się"
Then you can easy translate to any other language and its pretty easy to implement.
Also if you do not want to use Framework you can check this link to see example how it works:
http://tympanus.net/codrops/2009/12/30/easy-php-site-translation/
While this question probably is not a good SO question due to its broad nature. It might be relevant to many users.
My approach would be templates.
Your suggestion of having two html files is bad for the obvious reason of duplication- say you need to change something in your site. You would always need to change two html files- bad.
Having one html file and then parsing it and translating it sounds like a massive headache.
Some templating framework could help you massively. I have been using Smarty, but that's a personal choice and there are many options here.
The idea is you make a template file for your html and instead of actual content you use labels. Then in your php code you include the correct language depending on cookies, user settings or session data.
Storing labels is another issue here. Storing them in a database is a good option, however, remember you do not wish to make 100's of queries against a database for fetching each label. What you can do is store them in a database and then have it generate a language file- an array of labels->translations for faster access and regenerate these files whenever you add/update labels.
Or you can skip the database altogether and just store them in files, however, as these grow they might not be as easy to maintain.
I think the easiest mistake for an "amateur programmer" to make in this area is to allow the two (or more) language versions of the site to diverge. (I've seen so-called professionals make that mistake too...) You need to design it so everything that's common between the two versions is shared, so that when the time comes to make changes, you only need to make the changes in one place. The way you do this will depend on your choice of tools, and I'm not going to start advising on that, because it depends on many factors, for example the extent to which your content is database-oriented.
Closely related to this are questions of who is doing the translation, how the technical developers and the translators work together, and how to keep track of which content needs to be re-translated when it changes. These are essentially process questions rather than coding questions, so not a good fit for SO.
Don't expect that a site for one language can be translated without any technical impact; you will find you have made all sorts of assumptions about the length of strings, the order of fields, about character coding and fonts, and about cultural things like postcodes, that turn out to be invalid when you try to convert the site to a different language.
You could make 2 different language files and use php constants to define the text you want to translate for example:
lang_pl.php:
define("_TEST", "polish translation");
lang_en.php:
define("_TEST", "English translation");
now you could make a choice for the user between polish or english translation and based on that you can include the language file.
So if you would have a text field you put its value to _TEST (inbetween php brackets).
and it would show the translation of the chosen option.
The place i worked was doing it like this: they didn't have to much writing on their site, so they were keeping it in a database. As your tags have a php , I assume you know how to use databases. They had a mysql table called languages with language id(in your case 1 for en and 2 for pl) and texts in columns. So the column names were main heading, intro_text, about_us... when the user comes it selects the language and a php request get the page with that language. This is easy because your content is static and can be translated before site gets online, if your content is dynamic(users add content) you may need to use a translation machine because you cannot expect your users to write their entry in all languages.
I have raw data in text file format with lot of repetitive tokens (~25%). I would like to know if there's any algorithm which will help:
(A) store data in compact form
(B) yet, allow at run time to re-constitute the original file.
Any ideas?
More details:
the raw data is consumed in a pure html+javascript app, for instant search using regex.
data is made of tokens containing (case sensitive) alpha characters, plus few punctuation symbols.
tokens are separated by spaces, new lines.
Most promising Algorithm so far: Succinct data structures discussed below, but reconstituting looks difficult.
http://stevehanov.ca/blog/index.php?id=120
http://ejohn.org/blog/dictionary-lookups-in-javascript/
http://ejohn.org/blog/revised-javascript-dictionary-search/
PS: server side gzip is being employed right now, but its only a transport layer optimization, and doesn't help maximize use of offline storage for example. Given the massive 25% repetitiveness, it should be possible to store in a more compact way, isn't it?
Given that the actual use is pretty unclear I have no idea whether this is helpful or not, but for smallest total size (html+javascript+data) some people came up with the idea of storing text data in a greyscale .png file, one byte to each pixel. A small loader script can then draw the .png to a canvas, read it pixel for pixel and reassemble the original data this way. This gives you deflate compression without having to implement it in Javascript. See e.g. here for more detailled information.
Please, do not use a technique like that unless you have pretty esotheric requirements, e.g. for a size-constrained programming competition. Your coworkers will thank you :-)
Generally speaking, it's a bad idea to try to implement compression in JavaScript. Compression is the exact type of work that JS is the worst at: CPU-intensive calculations.
Remember that JS is single-threaded1, so for the entire time spent decompressing data, you block the browser UI. In contrast, HTTP gzipped content is decompressed by the browser asynchronously.
Given that you have to reconstruct the entire dataset (so as to test every record against a regex), I doubt the Succinct Trie will work for you. To be honest, I doubt you'll get much better compression than the native gzipping.
1 - Web Workers notwithstanding.
I have a grammar for a domain specific language, and I need to create a javascript code editor for that language. Are there any tools that would allow me to generate
a) a javascript incremental parser
b) a javascript auto-complete / auto-suggest engine?
Thanks!
An Example of implementing content assist (auto-complete)
using Chevrotain Javascript Parsing DSL:
https://github.com/SAP/chevrotain/tree/master/examples/parser/content_assist
Chevrotain was designed specifically to build parsers used (as part of) language services tools in Editors/IDEs.
Some of the relevant features are:
Automatic Error Recovery / Fault tolerance because editors and IDEs need to be able to handle 'mostly valid' inputs.
Every Grammar rule may be used as the starting rule as an Editor/IDE may only want to implement incremental parsing for performance reasons.
You may want jison, a js parser generator. In terms of auto-complete / auto-suggest...most of the stuff out there I know if more based on word completion rather than code completion. But once you have a parser running I don't think that part is too difficult..
This is difficult. I'm doing the same sort of thing myself.
One approach is:
You need is a parser which will give you an array of the currently possible ASTs for the text up until the token before the current cursor position.
From there you can see the next token can be of a number of types (usually just one), and do the completion, based on the partial text.
If I ever get my incremental parser working, I'll send a link.
Good luck, and let me know if you find a package which does this.
Chris.
I'm trying to scrape some Japanese websites for a personal project. Sites with text in UTF-8 work perfectly fine, as you'd expect, but I can't get any text out of sites specifying other international encodings, specifically EUC-JP. Node also seems to be interpreting the text and performing modifications rather than passing it on raw - I've tried setting the response to be interpreted as both ascii and binary, and then set my terminal application to EUC-JP, but after doing a console.log(), neither result in the actual text.
I've had a scan through the Node documentation, and it seems to only support two main text encodings (apart from binary and base64.)
I'm using the inbuilt http client, and specifying the encoding through the response.setEncoding method, e.g. response.setEncoding('utf8');
How are other people working with international text in Node (especially with regard to situations where the original data is not in UTF-8?) Are binary buffers the only way?
While I've done a bit of research, I'm not hugely knowledgeable when it comes to character encoding, so simple answers would be appreciated. Thanks!
There is a module that adds iconv bindings to node.js. If you grab the response as a binary Buffer, you can use Iconv.convert to convert it from EUC-JP to UTF-8 (take a look at the README for an example).