I am currently trying to get the page count of a Word document in openXML format and have been able to get to the point of where I have the XML structure of the document in a readable format, but I can't seem to find where the page count property is. Any guidance would be appreciated.|
UPDATE:
You can access the page count and other metadata by accessing the docProps/app.xml file. All you have to do is separate and extract the data you want. I got the page count by doing this.
const XMLData = fs.readFileSync(data, { encoding: "utf-8" });
let pageCount = XMLData.split("<Pages>")
.join(",")
.split("</Pages>")
.join(",")
.split(",")[1];`
const fs = require("fs");
const path = require("path");
const axios = require("axios");
let noRepeatDocs = ['somewebsite.com/somedocument.docx'];
const writeTheFile = async (data) => {
fs.writeFileSync("read_word_doc", data);
};
const unzipTheFile = async (data) => {
fs.createReadStream(data)
.pipe(unzipper.Parse())
.on("entry", function (entry) {
const fileName = entry.path;
const type = entry.type;
const size = entry.vars.uncompressedSize;
if (fileName === "word/document.xml") {
entry.pipe(fs.createWriteStream("./output"));
} else {
entry.autodrain();
}
});
};
const getWordBuffer = async (arr) => {
for (const wordDocLink of arr) {
const response = await axios({
url: wordDocLink,
method: "GET",
responseType: "arraybuffer",
});
const data = response.data;
await writeTheFile(data);
await unzipTheFile("./read_word_doc");
}
};
getWordBuffer(noRepeatDocs);
Related
i am trying to make a component that take a pdf from input or an already uploaded one and then extract pages from it and uploaded again
when choosing a file from input (choosing file from my computer)
i am using this
const handleFileChange = async (event) => {
const file = event.target.files[0];
setFiles(event.target.files[0])
const fileName = event.target.files[0].name
setFileName(fileName);
const fileReader = new FileReader();
fileReader.onload = async () => {
const pdfBytes = new Uint8Array(fileReader.result);
const pdfDoc = await PDFDocument.load(pdfBytes);
setPdfDoc(pdfDoc);
setPdfBlob(pdfBytes)
};
fileReader.readAsArrayBuffer(file);
setShowPdf(true)
};
we get a pdfDoc and a Unit8Array
then i use the pdfDoc to get pages and extract a new pdf file....
this works fine
now when selecting a file that we already uploaded
i use this to ping the api to get the file
const handleGetFile = async (url) => {
const headers = {
Authorization: "Bearer " + (localStorage.getItem("token")),
Accept: 'application/pdf'
}
await axios.put(`${process.env.NEXT_PUBLIC_API_URL}getPdfFileBlob`, {
pdfUrl: `https://handle-pdf-photos-project-through-compleated-task.s3.amazonaws.com/${url}`
}, { responseType: 'arraybuffer', headers }).then((res) => {
const handlePdf = async () => {
const uint8Array = new Uint8Array(res.data);
const pdfBlob = new Blob([uint8Array], { type: 'application/pdf' });
setPdfBlob(uint8Array)
// setPdfDoc(pdfBlob) .....? how do i create a pdf doc from the unit8array
}
handlePdf()
}).catch((err) => {
console.log(err)
})
}
this the the end point i am pinging
app.put('/getPdfFileBlob',async function(req,res){
try {
console.log(req.body.pdfUrl)
const url =req.body.pdfUrl;
const fileName = 'file.pdf';
const file = fs.createWriteStream(fileName);
https.get(url, (response) => {
response.pipe(file);
file.on('finish', () => {
file.close();
// Serve the file as a response
const pdf = fs.readFileSync(fileName);
res.setHeader('Content-Type', 'application/pdf');
res.setHeader( 'Content-Transfer-Encoding', 'Binary'
);
res.setHeader('Content-Disposition', 'inline; filename="' + fileName + '"');
res.send(pdf);
});
});
} catch (error) {
res.status(500).json({success:false,msg:"server side err"})
}
})
after getting this file here is what am trying to do
const handlePageSelection = (index) => {
setSelectedPages(prevSelectedPages => {
const newSelectedPages = [...prevSelectedPages];
const pageIndex = newSelectedPages.indexOf(index);
if (pageIndex === -1) {
newSelectedPages.push(index);
} else {
newSelectedPages.splice(pageIndex, 1);
}
return newSelectedPages;
});
};
const handleExtractPages = async () => {
for (let i = pdfDoc.getPageCount() - 1; i >= 0; i -= 1) {
if (!selectedPages.includes(i + 1)) {
pdfDoc.removePage(i);
}
}
await pdfDoc.save();
};
well in the first case where i upload the pdf file from local storage i get a pdfDoc
console of pdf Doc and pdfBlob
and when i select already existing file i can't find a way to transfer unit8array buffer to pdf doc
log of pdfBlob and no pdf doc
what i want is transform the pdfblob to pdfDcoument or get the pdf document from the array buffer so i can use getpages on it
I am trying to download a file, it does not work after download. I am getting files but the size is 1kb which is not actual file size.
If I used fetchResp.text() I am not able to open a file name.
Here is full code.
I think the problem could be here: return await fetchResp.text();
This is example, it is also important to set cookies, because i want to download data behind login.
How to handle puppeteer cookies and fetch?
What if i put fetch function outside page.evaluation. Does { credentials: "include" } will work?
Thanks in advance for your help.
const puppeteer = require("puppeteer");
const cheerio = require("cheerio");
const fs = require("fs");
(async () => {
const browser = await puppeteer.launch({
args: ["--no-sandbox"],
headless: false,
slowMo: 30,
});
const page = await browser.newPage();
await page.goto(
"https://file-examples.com/index.php/sample-documents-download/sample-xls-download/"
);
const content = await page.content();
const $ = cheerio.load(content);
const listings = $("#table-files > tbody > tr:has(a)")
.map((index, element) => {
const URL = $(element).find("a").attr("href");
const Filename = $(element).find("a").attr("href").split("/").pop();
//.replace(/^.*[\\\/]/g, "");
const name = $(element)
.find("td:nth-child(1)")
.text()
.trim()
.replace("\n", "");
return {
Filename,
URL,
};
})
.get();
for (let val of listings) {
const downloadUrl = val.URL;
const Filename = val.Filename;
console.log(val);
const downloadedContent = await page.evaluate(async (downloadUrl) => {
const fetchResp = await fetch(downloadUrl, { credentials: "include" });
return await fetchResp.text();
}, downloadUrl);
fs.writeFile(`./${Filename}`, downloadedContent, () =>
console.log("Wrote file")
);
}
await page.close();
await browser.close();
})();
The main problem here is that you are getting the file contents as just text, which would be fine if you wanted a plain text file, but you need to write an excel file, so you will need blob or an arrayBuffer, both of which cannot be returned from the page.evaluate method. See https://github.com/puppeteer/puppeteer/issues/3722
So you don't need to fetch the excel files using the page.evaluate function from puppeteer, you can directly get them from node using https module after getting all the links and then stream the contents to the files, which is easier in this case and also less code. You'll need these modifications
First require the https module
const https = require('https');
Then close puppeteer after getting the links, since we don't need it anymore
.get();
await page.close();
await browser.close();
Call the function here, when looping throught the links
for (let val of listings) {
const downloadUrl = val.URL;
const Filename = val.Filename;
console.log(val);
var file = await getFile(downloadUrl, Filename);
}
Finally, you need to create a function to read/write the file, outside of your main code block
function getFile(downloadUrl, Filename) {
var data = '';
var writeStream = fs.createWriteStream(Filename);
var req = https.get(downloadUrl, function(res) {
res.pipe(writeStream);
res.on('end', () => {
console.log('No more data in response.');
});
});
req.end();
}
Full snippet
const puppeteer = require('puppeteer');
const cheerio = require("cheerio");
const fs = require("fs");
const https = require('https');
(async () => {
const browser = await puppeteer.launch({
args: ["--no-sandbox"],
headless: false,
slowMo: 30,
});
const page = await browser.newPage();
await page.goto(
"https://file-examples.com/index.php/sample-documents-download/sample-xls-download/"
);
const content = await page.content();
const $ = cheerio.load(content);
const listings = $("#table-files > tbody > tr:has(a)")
.map((index, element) => {
const URL = $(element).find("a").attr("href");
const Filename = $(element).find("a").attr("href").split("/").pop();
//.replace(/^.*[\\\/]/g, "");
const name = $(element)
.find("td:nth-child(1)")
.text()
.trim()
.replace("\n", "");
return {
Filename,
URL,
};
})
.get();
await page.close();
await browser.close();
for (let val of listings) {
const downloadUrl = val.URL;
const Filename = val.Filename;
console.log(val);
//call the function with each link and filename
var file = await getFile(downloadUrl, Filename);
}
})();
//send request and stream the response to a file
function getFile(downloadUrl, Filename) {
var writeStream = fs.createWriteStream(Filename);
var req = https.get(downloadUrl, function(res) {
res.pipe(writeStream);
res.on('end', () => {
console.log('No more data in response.');
});
});
req.end();
}
EDIT Saw your comment, you can send cookies by modifying the get request like this, but remember about the same domain policy for cookies
function getFile(downloadUrl, Filename) {
var url = new URL(downloadUrl)
var options = {
hostname: url.hostname,
path: url.pathname,
method: 'GET',
headers: {
'Cookie': 'myCookie=myvalue'
}
};
var writeStream = fs.createWriteStream(Filename);
var req = https.request(options, function(res) {
res.pipe(writeStream);
res.on('end', () => {
console.log('No more data in response.');
});
});
req.end();
}
I am attempting to grab a PDF stored in Azure Blob Storage via a node backend and then serve that PDF file to a React Frontend. I am using Microsofts #azure/storage-blob with a BlockBlobClient but every example I find online converts the readableStreamBody to a string. The blob has a content type of application/pdf. Ive tried passing the readableStreamBody and the pure output to the frontend but those result in broken pdf's. I also followed the documentation online and made it a string and passed that to the frontend. That produced a PDF that would open and had the proper amount of pages but was completly blank.
Node.js Code on the Backend
app.get('/api/file/:company/:file', (req, res) => {
const containerClient = blobServiceClient.getContainerClient(req.params.company);
const blockBlobClient = containerClient.getBlockBlobClient(req.params.file);
blockBlobClient.download(0)
.then(blob => streamToString(blob.readableStreamBody))
.then(response => res.send(response))
});
FrontEnd Code
getFileBlob = (company,file) => {
axios(`/api/file/${company}/${file}`, { method: 'GET', responseType: 'blob'})
.then(response => {
const file = new Blob(
[response.data],
{type: 'application/pdf'});
const fileURL = URL.createObjectURL(file);
window.open(fileURL);
})
.catch(error => {
console.log(error);
});
}
This might help you, its working for me.
Node
var express = require('express');
const { BlobServiceClient } = require('#azure/storage-blob');
var router = express.Router();
const AZURE_STORAGE_CONNECTION_STRING =
'YOUR_STRING';
async function connectAzure() {
// Create the BlobServiceClient object which will be used to create a container client
const blobServiceClient = BlobServiceClient.fromConnectionString(
AZURE_STORAGE_CONNECTION_STRING
);
const containerName = 'filestorage';
const blobName = 'sample.pdf';
console.log('\nConnecting container...');
console.log('\t', containerName);
// Get a reference to a container
const containerClient = blobServiceClient.getContainerClient(containerName);
// Get a block blob client
const blockBlobClient = containerClient.getBlockBlobClient(blobName);
for await (const blob of containerClient.listBlobsFlat()) {
console.log('\t', blob.name);
}
const downloadBlockBlobResponse = await blockBlobClient.download(0);
const data = await streamToString(downloadBlockBlobResponse.readableStreamBody)
return data;
}
async function streamToString(readableStream) {
return new Promise((resolve, reject) => {
const chunks = [];
readableStream.on('data', data => {
chunks.push(data.toString());
});
readableStream.on('end', () => {
resolve(chunks.join(''));
});
readableStream.on('error', reject);
});
}
router.get('/', async function(req, res, next) {
const data = await connectAzure();
res.send({data}).status(200);
});
module.exports = router;
Front-end
function createFile() {
fetch('/createfile').then(res => {
res.json().then(data => {
var blob = new Blob([data.data], { type: 'application/pdf' });
var fileURL = URL.createObjectURL(blob);
if (filename) {
if (typeof a.download === 'undefined') {
window.location.href = fileURL;
} else {
window.open(fileURL, '_blank');
}
}
})
}).catch(err => console.log(err))
}
HTML
<body><h1>Express</h1><p>Welcome to Express</p><button onclick="createFile()">Create File</button></body>
I'm trying to scrape a website with load more button, but I can't do a recursive function with in nightmare. my code is something like this:
const Nightmare = require('nightmare');
const nightmare = Nightmare({
show:true
});// }
const request = require('request');
const cheerio = require('cheerio');
let url = 'https://www.housers.com/es/proyectos/avanzado';
let propertyArray = [];
var getThePage = function() {
nightmare
.goto('https://www.housers.com/es/proyectos/avanzado')
.wait(1500)
.click('#loadMore')
.evaluate(() =>{
return document.querySelector('.all-info').innerHTML;
})
.end()
.then((result) => {
let $ = cheerio.load(result);
let loadMore = $('#loadMore')
if (loadMore) {
getThePage();
}
return result
})
.catch((error) => {
console.error('Search failed:', error);
});
}
getThePage()
I don't know if you have any way to do it by this method or any other idea
If you want to scrap the data in the table, you don't need to use nightmare. From the network tab, you would see that it calls this endpoint :
https://www.housers.com/es/proyectos/avanzado/scroll
with some pagination & page size, let's take 200 per page (don't know if it's above the limit).
Then you just have to parse html & put data in an array :
const axios = require('axios');
const querystring = require('querystring');
const cheerio = require('cheerio');
const entities = require("entities");
const url = 'https://www.housers.com/es/proyectos/avanzado/scroll';
const prices = [];
function doRequest(url, page){
return axios.post(url + '?page=' + page + '&size=200', querystring.stringify({
word: "",
country: "",
type: "",
order: "STOCK_PRICE_VARIATION",
orderDirection: "DESC"
}));
}
async function getPrices() {
var empty = false;
var page = 0;
while (!empty) {
//call API
console.log("GET page n°" + page);
var res = await doRequest(url, page);
page++;
//parse HTML
const $ = cheerio.load(res.data,{
xmlMode: true,
normalizeWhitespace: true,
decodeEntities: true
});
if (res.data.trim() !== ""){
//extract prices : put it in array
$('tr').map(function(){
var obj = [];
$(this).children('td').map(function(){
obj.push(entities.decodeHTML($(this).text().trim()));
});
prices.push(obj);
});
}
else {
empty = true;
}
}
console.log(prices);
console.log("total length : " + prices.length);
}
getPrices();
This is the module that collections and exports async data: scraper.js
const express = require('express')
const cheerio = require('cheerio')
const request = require("tinyreq")
const fs = require('fs')
const _ = require('lodash')
const uuid = require('uuid/v4')
const async = require('async')
const mental_models = {
url: 'https://www.farnamstreetblog.com/mental-models/',
data: {}
}
const decision_making = {
url: 'https://www.farnamstreetblog.com/smart-decisions/',
data: {}
}
const cognitive_bias = {
url: 'https://betterhumans.coach.me/cognitive-bias-cheat-sheet-55a472476b18',
data: {}
}
const DATA_URLS = [mental_models, decision_making, cognitive_bias]
const filterScrape = async (source, params) => {
let filtered_data = {
topics: [],
content: [],
additional_content: []
}
let response = await scrape(source)
try {
let $ = cheerio.load(response)
params.forEach((elem) => {
let headers = ['h1', 'h2', 'h3']
if ($(elem) && headers.includes(elem)) {
let topic = {}
let content = {}
let id = uuid()
topic.id = id
topic.text = $(elem).text()
if ($(elem).closest('p')) {
content.text = $(elem).closest('p').text()
content.id = id
}
filtered_data.topics.push(topic)
filtered_data.content.push(content)
} else if ($(elem) && !headers.includes(elem)) {
let content = {}
let id = uuid()
content.text = $(elem).text()
content.id = id
filtered_data.additional_content.push(content)
} else {
}
})
}
catch (err) {
console.log(err)
}
return filtered_data
}
const scrape = (source) => {
return new Promise((resolve, reject) => {
request(source.url, function (err, body) {
if (err) {
reject(err)
return
}
resolve(body)
})
})
}
const DATA = _.map(DATA_URLS, async (source) => {
let params = ['h1', 'h2', 'h3', 'p']
let new_data = await filterScrape(source, params)
try {
source.data = new_data
}
catch (err) {
console.log(err)
}
})
module.exports = DATA
This is the module that imports the data: neural.js
const brain = require('brain')
const neural_net = new brain.NeuralNetwork()
const DATA = require('./scraper')
console.log(DATA)
Obviously not much going on, I've removed the code since the variable doesn't resolve. When logged it logs a promise but the promise does not resolve. However in the imported module, the promise is logged and then resolves. What gives? Should I import a function that resolves the data?
Of course it would be best to import that function, however it won't change the issue in your code which is here:
const DATA = _.map(DATA_URLS, async (source) => {
Lodash doesn't support async iteration - so you need to have some other method, one would be to use the newest nodejs version (10.x) and make use of async iteration - but that won't use the full power of asynchronous code.
You can also use scramjet - a framework my company is supporting. The code above would take the following form:
const {DataStream} = require("scramjet");
const DATA_URLS = [mental_models, decision_making, cognitive_bias];
module.exports = async () => DataStream.fromArray(DATA_URLS)
.setOptions({maxParallel: 2}) // if you need to limit that at all.
.map(async ({url}) => {
let params = ['h1', 'h2', 'h3', 'p']
let data = await filterScrape(source, params);
return { url, data };
})
.toArray();
The other file would take the following form:
const brain = require('brain')
const neural_net = new brain.NeuralNetwork()
const scraper = require('./scraper')
(async (){
const DATA = await scraper();
console.log(DATA); // or do whatever else you were expecting...
})();