How to extract text from PDF? - javascript

I'm creating a React Application with NodeJS and it needs to get some text from a PDF that the user upload.
I already tried to use: pdf-parse, pdf2json, pdf.js and react-pdf-js. The file should be selected by the user, and all those libraries use a Path to acess the file. What should I do?
PS1: I'm using a input type='file' button to get the file.
The code must work both NodeJS and Web Browser

You doesn't upload any code snippet, so my answer is according to this scenario
You can see this example, this is perfect example for "HOW TO USE pdf.js"
http://git.macropus.org/2011/11/pdftotext/example/
And this is the code on git
https://github.com/hubgit/hubgit.github.com/tree/master/2011/11/pdftotext
but I think you have to make some changes according to your requirement
Enjoy..

I'm answering my own question. First I create a regular html input.
<input type='file'/>
I'm using React, so I use onChange attribute in place of id.
So, when the user enters with the file, a function is activated and I use the following code to get the file:
const file = event.target.files[0];
file not has a path, which is used by PDF.JS to get the real file.
Then I use a FileReader to convert the file int a Array of bits (I guess):
const fileReader = new FileReader();
Then we set a function at fileReader.onload the function can be found here
fileReader.onload = function() {...}
Finally we do this:
fileReader.readAsArrayBuffer(file);
Important PS: pdf.pdfInfo must be replaced with pdf at new PDF.JS versions.
Thanks for helping.
Extra PS: To use pdfjsLib as PDFJS in React I did this in index.html file:
window.PDFJS = pdfjsLib

Related

Turn CSV file into array using Vanilla Javascript without input element

I am trying to read turn a CSV file, the file is on local, could be same folder with the script file. Since I am writing JSX for photoshop, I couldn't use any other library. And there are a lot of tutorial out there using input element which is not what I need. The path of the file could be hard coded. What I am trying to do is read the CSV, and take out some data. Please advise!
Let me explain it clearly!
I am writing JSX for photoshop script which has no browser element - input tag something like that. And it must be pure Javascript no library such as jQuery. I did a lot of google search what they do is taking the input tag from browser let user select the CSV file, I just want the file path is hard code, it is a fixed path and filename. And I don't see any tutorial for read CSV file and turn into array via vanilla javascript.
You can use the File class. How this works is explained in the ExtendScript toolkit docs which are installed on your computer alongside Creative Cloud. An online version can also be found here. (The scripting guide references this under the File object on page 110, referring to a section about JavaScript on different platforms on page 32, which then refers to the ExtendScript docs.)
Example:
const file = new File("/c/Users/user/Desktop/text.csv");
file.encoding = 'UTF-8';
file.open("r");
const contents = file.read();
file.close();
alert(contents);

How can I Read a Font File That a User Uploads?

I am working on a web-based project that is fully in Javascript. In it, I want users to be able to format their text exactly as they want. Naturally, I would like them to have the ability to upload their own fonts. However I'm not sure how to read the file they upload. I can take in the file, but I can't utilize their .ttf/.woff/woff2 etc. as an actual font file. I've used the FileReader API and have read in a ttf as a DataUrl, which puts it into base64. However I'm not sure how to turn it back into a file.
I've found this code from another post made on here, but it doesn't exactly do what I need it to do:
//read the file
const reader = new FileReader();
reader.addEventListener('load', (event) => {
<usersSelectedFile>.src = event.target.result;
});
var fontFile = reader.readAsDataURL(file);
With this, I get the file in base64. I know how to use font-face, but I've tried passing this fontFile in a font face style sheet but I got nothing from it.
My ultimate question is: How can I read a file in base64 as if it were a normal file? How should I reference it in font-face?
ALSO: I want to mention that I am trying to have this be stored in localStorage, as I wouldn't want any user-made changes to be global.
This is an interesting problem that I would like to know the definitive solution to. The best I could find by searching around was this solution, that utilizes and API for this specific problem. I have not tested it as of this moment, but it seems reliable.
As for the localStorage issue, have you tried converting the file to a JSON string using JSON.stringify(), and then using localStorage.setItem('name',DATAHERE)? I am not sure if this works with files, but this is what I use for arrays and non-string information when saving to localStorage.
Sorry for not having anything concrete for you. I'm looking forward to working this out further if none of my recommendations helped you.

Creating an Image File using node.js buffer from input text

I have a requirement in which I need to convert an input text to a png/jpeg file and then convert to base64 string and send as an input to an API.
I cannot use node.js fs module since I cannot physically create files.
So I was trying to user node.js Buffer module to achieve the same.
But the issue I'm facing is, I cannot add extension to it(I don't know if there is any such option).
Is there any other way of doing so?
Below is the code I tried...
function textToFileBase64(str){
var buf = Buffer.from(str, 'utf-8');
return buf.toString('base64');
}
The only problem with the above code is that it creates a file without extension and even if I need the file as abc.png, it says the file is damaged when I open it.

How to send only the text from a text file

What I need to do is:
Let user choose txt file from his disc
Get the text from it to let's say a variable
Send it (the variable value) via AJAX
For the first point I want to know if I should use normal input type (like if I would like to send file via POST) <input type="file">
For the second point I need to know how to get the name of the file user selected and then read text from it. Also I'm not good with javascript so I don't really know how long can a string be there (file will have about 15k lines on average)
For the third I need nothing to know if I can have the data stored in a variable or an array.
Thanks in advance.
P.S. I guess javascript is not a fast language, but (depending on the editor) it sometimes opens on my computer the way that I have all the needed data in first 5 or 6 lines. Is it possible to read only first few lines from the file?
It is possible to get what you want using the File API as #dandavis and other commentors have mentioned (and linked), but there are some things to consider about that solution, namely browser support. Bottom line is the File API is currently a working draft of the w3c. And bottom line is even w3c recommended things aren't always fully supported by all browsers.
What solution is "best" for you really boils down to what browser/versions you want to support. If it were my own personal project or for a "modern" site/audience, I would use the File API. But if this is for something that requires maximum browser support (for older browsers), I would not currently recommend using the File API.
So having said all that, here is a suggested solution that does NOT involve using the FIle API.
supply an input type file in a form for the user to specify file. User will have to select the file (javascript cannot do this)
use form.submit() or set the target attribute to submit the form. There is an iframe trick for submitting a form without refreshing the page.
use server-side language of choice to respond with the file info (name, contents, etc.). For example in php you'd access the posted file with $_FILES
then you can use javascript to parse the response. Normally you'd send it as a json encoded response. Then you can do whatever you want with the file info in javascript.
With Chrome and Firefox you can read the contents of a text file like this:
HTML:
<input type="file" id="in-file" />
JavaScript with jQuery:
var fileInput = $('#in-file');
fileInput.change(function(e) {
var reader = new FileReader();
reader.onload = function(e) {
console.log(reader.result);
}
reader.readAsText(fileInput[0].files[0]);
});
IE doesn't support the FileReader object.

Generate a Word document in JavaScript with Docx.js?

I am trying to use docx.js to generate a Word document but I can't seem to get it to work.
I copied the raw code into the Google Chrome console after amending line 247 to fix a "'textAlign' undefined error"
if (inNode.style && inNode.style.textAlign){..}
Which makes the function convertContent available. The result of which is an Object e.g.
JSON.stringify( convertContent($('<p>Word!</p>)[0]) )
Results in -
"{"string":
"<w:body>
<w:p>
<w:r>
<w:t xml:space=\"preserve\">Word!</w:t>
</w:r>
</w:p>
</w:body>"
,"charSpaceCount":5
,"charCount":5,
"pCount":1}"
I copied
<w:body>
<w:p>
<w:r>
<w:t xml:space="preserve">Word!</w:t>
</w:r>
</w:p>
</w:body>
into Notepad++ and saved it as a file with an extension of 'docx' but when I open it in MS Word but it says 'cannot be opened because there is a problem with the contents'.
Am I missing some attribute or XML tags or something?
You can generate a Docx Document from a template using docxtemplater (library I have created).
It can replace tags by their values (like a template engine), and also replace images in a paid version.
Here is a demo of the templating engine: https://docxtemplater.com/demo/
This code can't work on a JSFiddle because of the ajaxCalls to local files (everything that is in the blankfolder), or you should enter all files in ByteArray format and use the jsFiddle echo API: http://doc.jsfiddle.net/use/echo.html
I know this is an older question and you already have an answer, but I struggled getting this to work for a day, so I thought I'd share my results.
Like you, I had to fix the textAlign bug by changing the line to this:
if (inNode.style && inNode.style.textAlign)
Also, it didn't handle HTML comments. So, I had to add the following line above the check for a "#text" node in the for loop:
if (inNodeChild.nodeName === '#comment') continue;
To create the docx was tricky since there is absolutely no documentation on this thing as of yet. But looking through the code, I see that it is expecting the HTML to be in a File object. For my purposes, I wanted to use the HTML I rendered, not some HTML file the user has to select to upload. So I had to trick it by making my own object with the same property that it was looking for and pass it in. To save it to the client, I use FileSaver.js, which requires a blob. I included this function that converts base64 into a blob. So my code to implement it is this:
var result = docx({ DOM: $('#myDiv')[0] });
var blob = b64toBlob(result.base64, "application/vnd.openxmlformats-officedocument.wordprocessingml.document");
saveAs(blob, "test.docx");
In the end, this would work for simple Word documents, but isn't nearly sophisticated for anything more. I couldn't get any of my styles to render and I didn't even attempt to get images working. I've since abandoned this approach and am now researching DocxgenJS or some server-side solution.
You may find this link useful,
http://evidenceprime.github.io/html-docx-js/
An online demo here:
http://evidenceprime.github.io/html-docx-js/test/sample.html
You are doing the correct thing codewise, but your file is not a valid docx file. If you look through the docx() function in docx.js, you will see that a docx file is actually a zip containing several xml files.
I am using Open Xml SDK for JavaScript.
http://ericwhite.com/blog/open-xml-sdk-for-javascript/
Basically, on web server, I have a empty docx file as new template.
when user in browser click new docx file, I will retrieve the empty docx file as template, convert it to BASE64 and return it as Ajax response.
in client scripts, you convert the BASE64 string to byte array and using openxmlsdk.js to load the byte array as an javascript OpenXmlPackage object.
once you have the package loaded, you can use regular OpenXmlPart to create a real document. (inserting image, creating table/row ).
the last step is stream it out to end user as a document. this part is security related. in my code I send it back to webserver and gets saved temporarily. and prepare a http response to notify end user to download it.
Check the URL above, there are useful samples of doing this in JavaScript.

Categories