I have some PDF templates that contain placeholders for things like a name, company, etc. They are in the format
<<'NAME'>> or <<'COMPANY'>>
Currently the process at my company is to replace all of these placeholders by hand when we get the information. I am trying to automate the process by getting the information from a CSV file and just doing a find and replace on the placeholders. However, the only files I have are inDesign files and PDF's for the templates. I looked at inDesign files, and as far as I can tell they are executable's and impossible to read in.
I was hoping someone knew of a way to read in a PDF file to do a regex on it to replace the placeholder text.
Related
I'm looking to find a tool or script that could take all instances of a person's first name and last name from a set of PDF based application forms and replace it with a randomly generated four-letter string.
Each PDF would have a different name (completed by a different person), but these could be found in the field, or at least are in the same place each time. This would serve to anonymise/blind applications for job applications. There will be multiple instances of each name through the file as references are included. The applicant names are also in the filenames.
I'll be looking to do this for hundreds of files. I could potentially have a file with the text to find and replace in columns but not sure how to achieve this.
I've seen this post on how to replace text in multiple files but it's less helpful for me having to replace specific instances of words in each file. How to program a text search and replace in PDF files
Any pointers of posts, tools or scripts that may help would be very useful. Many thanks.
-Sarah
I am not sure if the title is appropriate description of what i intend to do. However, below is the url from where I want to parse the csv file in python (the csv handle is visible on the top right corner of the interactive table).
https://www.mcxindia.com/market-data/bhavcopy
I have parsed files before using Requests and lxml but in those cases the address (or location) of the csv file was rather straightforward. In this case, I am not able to ascertain the actual url location of the file. Although rudimentary, my assessment is that it is embedded in javascript code. My question is whether I can indeed parse files such as this? if yes, how usingrequests and lxml
This is public data and a very inefficient alternative is to download the data daily and than parse the locally stored csv file but that is no automation. Any suggestion on how can i automate this task will be very valuable.
I am in the process of moving a static HTML onto WordPress.
I am trying to figure a way in which I can pull specific HTML content from the files(title tags, description tags, <h1> tags, etc.). I have around 120 local files and doing it all by hand would be a long process.
However, if I could get this data into a CSV I can quickly move this site.
Does anyone have any advice or experience with this type of process? Any help would be greatly appreciated.
The question is about extracting certain HTML elements, out of a given HTML file. There are multiple ways to do this. Let me point out some of them below.
1) Use a script with a Library to do this. For Java use JSOUP.
String br = "<html><source>foo bar bar</source></html>";
Document doc = Jsoup.parse(br, "", Parser.xmlParser());
for (Element sentence : doc.getElementsByTag("source"))
System.out.println(sentence.text());
}
This will give you the list of elements with the HTML tag source. You can do the same for other languages like python (use BeautifulSoup) and NodeJS.
2) You can write a script to read HTML files as text files and do a search on text.
Move all your HTML files into a folder, and write a small program to load each file and search for the specific tags. Later save it to a CSV or any preferred output.
3) You can do the same with grep.
Simple do a search and load the results directly into a CSV file.
There are multiple other ways to do it. Since you mentioned that the manual workload is higher, try doing a small script to get the job done. Use the first approach as it is faster and easier.
I have 2 docx files that I am working with. One docx file contains text information of a product (start serial number, length, width, and height). The other docx file contains a sticker label with an image and all of the text information from the first file.
This is what I do currently:
I open the first docx file and copy all of the text information (serial, length, width, and height)
Then I paste each info into the second docx file that contains the formatted label.
If I need to make more than one label, I copy the label and increment the serial number by 1.
This takes a lot of time to make several labels for different products. My goal is to come up with an easier way to take data from one docx and inject it into the other. Also, generating more labels when needed.
My first thought was to extract the docx file to get it's xml contents. Then read the data using javascript, c++, or any other language. Then Ask user to input number of labels to generate, manipulate the xml, and repack it as a docx file.
Then I thought about trying to use the windows office "mail merge" feature, but I have never done this before.
I would like to know if anyone has any suggestions for an easy solution to import data from one docx file and generating labels into another.
I am open for any suggestion.
Also, I am not a professional programmer. I am an undergraduate computer engineering student with some experience in c, c++, java, javascript, python, MIPS assembly, and php.
The only open-source (and probably easier to come by) solution I know know is:
http://poi.apache.org/
http://poi.apache.org/document/quick-guide-xwpf.html
This is a good bet when it comes to speed and it is free software.
But if you open a file, alter it and save it again - the result can be flaky: The formatting can be slightly off. At least in my tests with the pptx counterpart.
I reckon when you have user interaction (web page?) in order to create the document, you can build a small HTTP Api around the library.
There is also: http://www.docx4java.org/trac/docx4j - which I have not tested yet.
You can also go the C#/Redmond way: How do I create the .docx document with Microsoft.Office.Interop.Word?
The Interop (2nd Example in the first answer of the question above) way gives the best result when it comes to the accuracy of the formatting. Basically when you open a file with Interop - it will look the same when you alter and save it. But you cannot use this when interacting with a user - because it starts a separate MS Office process - and I would not count on this from my own user experience. But if you want to generate these files as a batch in a single user session - it will deliver a good result.
I cannot comment on the "OpenXML SDK" library described in the above SO question.
Wath about the Open XML https://www.youtube.com/watch?v=rMnEl6JZ7I8 and website developer http://openxmldeveloper.org/ .
On the site you found sdk for:
Open XML SDK for JavaScript: http://openxmldeveloper.org/wiki/w/wiki/open-xml-sdk-for-javascript.aspx. Demo: http://openxmldeveloper.org/blog/b/openxmldeveloper/p/openxmlsdkjs_demo.aspx
Open XML and Java http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21/openxmlandjava.aspx
.Net Resources http://openxmldeveloper.org/resources/dotnet/m/cc/default.aspx
I'm using Open XML SDK for javascript...
I downloaded it from here
I create word document and save it on desktop with this SDK...
now I want do modify content of this document
for example: if in the text is word "firstname", I want to change this word with "John" and so on...
second what I want is that before I save document put it in .rar or .zip file and then save it like .rar on desktop
can somebody help me?
if in the text is word "firstname", I want to change this word with "John" and so on...
Search and Replace Method
second what I want is that before I save document put it in .rar or .zip file and then save it like .rar on desktop
Rename the .docx file to .zip, since they are one and the same:
every Open XML file is essentially a Zip archive containing many other files. Office-specific data is stored in multiple XML files inside that archive. This is in direct contrast with the old WordML and SpreadsheetML formats which were single, non-compressed XML files. Although more complex, the new approach offers a few benefits:
References
Open XML search and replace (searchandreplace.zip)
OpenXmlSimpleType members
Word, power point, Excel are just zip files them self with different extension. If you want to change strings in a excel file, look at the shared strings table where all the strings are stored and referenced by the sheets.