read information off website and store in excel file - javascript

I am trying to build this application that when provided a .txt file filled with isbn numbers will visit the isbn.nu page for that isbn number by simply appending the isbn to the url www.isbn.nu/your isbn number.
After pulling up the page, I want to scan it for information about the book, and store that in an excel file.
I was thinking about creating a file stream of the url in Java, but I am not really sure how to extract the information from the html page. Storing the information will be done using the JExcel Java package.
My best guess would be using javascript to extract the information, but I don't know how to call the javascript from my java program.
Is my idea plausible? if not, what do you guys suggest I do.
my goal: retrieve information from an html page and store it in an excel file for each ISBN in a text file. There can be any number of isbn's in a text file.
This isn't homework btw, I am simply doing this for an organization that donates books to Sudan. Currently they have 5 people cataloging these books manually and I am one of them.

Jsoup is a useful tool for parsing a web page and getting data from it. You can do it in Java and it's pretty easy.
You can parse the text file, build the URL with a string, send it in with JSoup then use JSoup to parse out the information using the html tags on the page. Then you can store it out however you want. You really don't need to use Javascript at all if you're more comfortable with Java.
Example for reading a page and parsing it with Jsoup:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Use a div in which you load your link (example here how to do that http://api.jquery.com/load/).
After that when load is complete you can check what is the name of the div's or spans used in the webpage and get that content with val (http://api.jquery.com/val/) or text (http://api.jquery.com/text/)

Here is text from the main page of www.isbn.nu:
Please note that isbn.nu is designed for manual searching by individuals. It is not intended as an information resource for automated retrieval, nor as a research tool for companies. isbn.nu reserves the right to deny access based on excessive requests.
Why not just use the free Google books API that would return book details in XML format. There are many classes available in Java to parse XML feeds and would make your life much easier.
See http://code.google.com/apis/books/ for more info.

Here are the steps needed:
Create CURL request (you can use multiple curl requests)
Get body data
Parse data
Make excel file
You can read HTML information using this guide.

A simple solution might be to use a Google Docs spreadsheet function like ImportXML(URL,path-expression).
More information and examples here:
http://www.seerinteractive.com/blog/importxml-cookbook/
http://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/
http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/

Related

I can't find Xpath on this website or maybe wrong syntax

I'm trying to scrape data from this url https://drive.getbigger.io/#/stores, however I didn't find the Xpath of the text I want to export, which are the producer's offer.
Firstly I try the importxml function in Google sheet:
=IMPORTXML(A1;"/html/body/flt-ruler-host/div[23]/p")
and it gave me N/A error "the imported content is empty"
so I try to scrape this website with add-ons and Parsehub, and it gave me every time a .csv file where i can't find the data I want to export.
Also I can't find the right Xpath for the data I would like to scrape, when I use the inspection tool, the data isn't in the <body> part.
However the Xpath I use in my importXML function is some code I find in the <body> part and which is close of the text I'd like to extract (the producer's offer).
It seems that the Xpath I am looking for is linked in the <head> part with some JavaScript code, also when I hover the page with the selection tool in order to scrape the data it select the whole page, maybe because there is a "scroll <div>".
So I wonder if the website use some kind of protection against scraping or other.
Please guys tell me if :
I could find the right Xpath in order to scrape with the importXML function?
Should I extract the data with a python script?
if the website block my attempts, how could I do this?
You won't be able to scrape anything with IMPORTXML formula since the website uses dynamic rendering (javascript).
So yes, Python+Selenium (or other combinations) could do the job. The website won't block you if you follow some rules (switch user-agent, add pauses between requests).
You would probably need these XPath :
Product description :
//p[1][string-length(text())>5][parent::flt-dom-canvas]
Product price :
//p[3][contains(text(),"€") and not (contains(text(),","))][parent::flt-dom-canvas]
However, I think the most elegant way to get the data is probably to use the API the website relies upon. With GoogleSheets and a custom ImportJSON script, you can obtain something like this (result for "fromage" as query) :
It won't work out of the box, you'll have to modify some parts of the script since it won't load a JSON (called with POST) which needs headers in the request. In a nutshell, you need to construct the payload part, add headers to the request ("Bearer XXXXX"), and add a parameter to a function to retrieve the results.
All this depends on your objective and your expected output.
EDIT : For references (constructing the payload, adding parameters) you can read :
https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app#fetchurl,-params
Look also the networktab of your browser developper tools in order to find : the url of the API and the correct parameters to send.

Text file data into a webpage for graphing

I am new to web dev and I have a text file that I created using C# to collect some data from a website. Now I want to use that data to make graphs or some way to show the info on a website. Is it possible to use I/O in javascript or what is my best option here? Thanks in advance.
You have several options at your disposal:
Use a server-side technology (like ASP.Net, Node.js etc) to load, parse and display the file contents as HTML
Put the file on a web server and use AJAX to load and parse it. As #Quantastical suggested in his comment, convert the file to JSON forma for easir handling in Javascript.
Have the original program save the file in HTML format instead of text, and serve that page. You could just serve the txt file as is, but the user experience would be horrible.
Probably option 1 makes the most sense, with a combination of 1 + 2 to achieve some dynamic behavior the most recommended.
If you are working in C# and ASP then one option is to render the html from the server without need for javascript.
In C# the System.IO namespace gives access to the File object.
String thetext = File.ReadAllText(fileName);
or
String[] thetextLines = File.ReadAllLines(fileName);
or
If you have JSON or Xml in the file then you can also read and deserialize into an object for easier use.
When you have the text you can create the ASP/HTML elements with the data. A crude example would be:
HtmlGenericControl label = new HtmlGenericControl("div");
label.InnerHTML = theText;
Page.Controls.Add(label);
There are also HTMLEncode and HTMLDecode methods if you need them.
Of course that is a really crude example of loading the text at server and then adding Html to the Asp Page. Your question doesn't say where you want this processing to happen. Javascript might be better or a combination or C# and javascript.
Lastly to resolve a physical file path from a virtual path you can use HttpContext.Current.Server.MapPath(virtualPath). A physical path is required to use the File methods shown above.

what language do i use to write a webpage in to automatically update from a database of sorts?

I have what I consider a bit of a tricky question. I am currently working on quite a large spread sheet (266 rows aith 70 coloumns and its only going to get bigger) that is a database of sorts and I want to remove it from Excel and put it on to an intranet page. I am currently writing it in a combination of HTML and Javascript for functionality, but it is becoming very hard to ensure that the data is in the right place. I am wondering if there is a possible way of being able to save the Excel spreadsheet into a certain format (like CSV or XML) and then write a program (for on a HTML page) that would display all of the infomation in a table automatically? is this even possible?
Unfortunatly i do not have access to a server to be able help with this, it all needs to be able to be coded in the page itself.
Thankyou for all your input Guys and Gals
Based on your comment, a normalized database for this type of thing would look like this:
table `workers`
- id
- name
- ...
table `trainings`
- id
- title
- description
- ...
table `workers_in_training`
- worker_id
- training_id
This allows you to create a logical matrix as well without the need to change the schema (keep adding columns) for each new training/worker. Of course, this realistically requires a database server of some sort and knowledge in a server side programming language (PHP, Python, Ruby, C#, anything). If you don't have that, an Access database/app may be an acceptable compromise. Doing it all in Javascript is certainly interesting, but is an idea you should abandon as early as possible.
Given your constraints, I would save the Excel spreadsheet as a CSV and put it in the same location as your HTML file, then use AJAX to fetch the contents of the CSV and dynamically generate a HTML table based on the contents.
Look here for how to fetch a URL's contents using AJAX (jQuery library): http://api.jquery.com/jQuery.get/
After fetching the URL content, you will have the CSV as a big string in a JavaScript variable. I'll let you have the fun of figuring out how to parse it :-)
Once you know how to parse your CSV string to recognise rows and columns, look here for how to generate HTML table dynamically using jQuery library: Building an HTML table on the fly using jQuery

How to display information contained in XML file from another website

I have an XML file ( XML file I produce ) which contains information about my parteners.
I want them to display on their website information relative to them by picking them into the XML file.
I have no idea to do that, ecxept that i need to write a 'parser' in javascript to display information. This javascript code i guess has to be on my partener's website.
could you please provide me examples to do that ? (how to write a parser, how to display only information for one partener ?)
Thank you,
Regards
I think this is a standard problem. Each browser does expose there own version of a Document type object which you can build and then use method like getElementByTagName to grab a particular node in the XML and process its data.
Few links which you can look at
http://www.hiteshagrawal.com/javascript/javascript-parsing-xml-in-javascript
http://www.mikechambers.com/blog/2006/01/09/parsing-xml-in-javascript/
I would suggest you to use prototype library for dealing with this

Learn how to make Flair for my users (javascript snippets)

I wanted to give my users a little piece of JavaScript or HTML code that they could put on their site and show information about them. Kind of like StackOverFlows new feature Flair.
I have an idea of how to code it. I was going to give them some JS with a HTML that had a DIV id="MySite_Info". Then the JS would go to my site and pull some JSON or XML and then fill in the data with a DIV in the HTML I gave them on their site.
Is there a better way to do this? Or any examples online I should follow? Whats the best way to create these javascript snippets? (Not sure what the proper name is)
There are two basic options.
Images (and pictures of text suck)
JavaScript - as you described
The approach I would take would be to:
Dynamically generate the JS using a server side process. This would include data for the user (using a JSON generator to easily produce the data in a suitable format).
Build the badge using standard DOM methods
Find the element with the document id and appendChild the generated badge

Categories