I have a set of URL links, some of which contains a table and some that does not return any table.
I've inspected the source and there is this particular table ID that loads at runtime. Is there a script or tool to check each URL content and see if it contains that particular table ID and return true?
I can use this script or tool to load a batch of URLs to see which URL contains the table that I am looking for.
This is web scraping. Many languages have a way to do this.
Pick what you are most comfortable with. However, Python with the Beautiful Soup library is a nice choice.
Related
I'm trying to scrape data from this url https://drive.getbigger.io/#/stores, however I didn't find the Xpath of the text I want to export, which are the producer's offer.
Firstly I try the importxml function in Google sheet:
=IMPORTXML(A1;"/html/body/flt-ruler-host/div[23]/p")
and it gave me N/A error "the imported content is empty"
so I try to scrape this website with add-ons and Parsehub, and it gave me every time a .csv file where i can't find the data I want to export.
Also I can't find the right Xpath for the data I would like to scrape, when I use the inspection tool, the data isn't in the <body> part.
However the Xpath I use in my importXML function is some code I find in the <body> part and which is close of the text I'd like to extract (the producer's offer).
It seems that the Xpath I am looking for is linked in the <head> part with some JavaScript code, also when I hover the page with the selection tool in order to scrape the data it select the whole page, maybe because there is a "scroll <div>".
So I wonder if the website use some kind of protection against scraping or other.
Please guys tell me if :
I could find the right Xpath in order to scrape with the importXML function?
Should I extract the data with a python script?
if the website block my attempts, how could I do this?
You won't be able to scrape anything with IMPORTXML formula since the website uses dynamic rendering (javascript).
So yes, Python+Selenium (or other combinations) could do the job. The website won't block you if you follow some rules (switch user-agent, add pauses between requests).
You would probably need these XPath :
Product description :
//p[1][string-length(text())>5][parent::flt-dom-canvas]
Product price :
//p[3][contains(text(),"€") and not (contains(text(),","))][parent::flt-dom-canvas]
However, I think the most elegant way to get the data is probably to use the API the website relies upon. With GoogleSheets and a custom ImportJSON script, you can obtain something like this (result for "fromage" as query) :
It won't work out of the box, you'll have to modify some parts of the script since it won't load a JSON (called with POST) which needs headers in the request. In a nutshell, you need to construct the payload part, add headers to the request ("Bearer XXXXX"), and add a parameter to a function to retrieve the results.
All this depends on your objective and your expected output.
EDIT : For references (constructing the payload, adding parameters) you can read :
https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app#fetchurl,-params
Look also the networktab of your browser developper tools in order to find : the url of the API and the correct parameters to send.
My website sells stuff and I would like to customize the page title and meta description in certain pages when certain items are viewed. I would want these custom titles and descripts to be listed when shared on other websites. E.g.: Twitter, FB, etc...
Basically I want to customize the title and description based on the query string values. How is this possible? I've looked a js based plugin or similar on github as well but had no luck.
The issue with JavaScript rendered pages is that there's no guarantee a scraper will pick up the meta tags. The headers are read before scripts are run - meaning to lower overhead they likely won't even bother executing it. Better if you write a script that writes customised meta tags to the HTML file.
I am trying to build this application that when provided a .txt file filled with isbn numbers will visit the isbn.nu page for that isbn number by simply appending the isbn to the url www.isbn.nu/your isbn number.
After pulling up the page, I want to scan it for information about the book, and store that in an excel file.
I was thinking about creating a file stream of the url in Java, but I am not really sure how to extract the information from the html page. Storing the information will be done using the JExcel Java package.
My best guess would be using javascript to extract the information, but I don't know how to call the javascript from my java program.
Is my idea plausible? if not, what do you guys suggest I do.
my goal: retrieve information from an html page and store it in an excel file for each ISBN in a text file. There can be any number of isbn's in a text file.
This isn't homework btw, I am simply doing this for an organization that donates books to Sudan. Currently they have 5 people cataloging these books manually and I am one of them.
Jsoup is a useful tool for parsing a web page and getting data from it. You can do it in Java and it's pretty easy.
You can parse the text file, build the URL with a string, send it in with JSoup then use JSoup to parse out the information using the html tags on the page. Then you can store it out however you want. You really don't need to use Javascript at all if you're more comfortable with Java.
Example for reading a page and parsing it with Jsoup:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Use a div in which you load your link (example here how to do that http://api.jquery.com/load/).
After that when load is complete you can check what is the name of the div's or spans used in the webpage and get that content with val (http://api.jquery.com/val/) or text (http://api.jquery.com/text/)
Here is text from the main page of www.isbn.nu:
Please note that isbn.nu is designed for manual searching by individuals. It is not intended as an information resource for automated retrieval, nor as a research tool for companies. isbn.nu reserves the right to deny access based on excessive requests.
Why not just use the free Google books API that would return book details in XML format. There are many classes available in Java to parse XML feeds and would make your life much easier.
See http://code.google.com/apis/books/ for more info.
Here are the steps needed:
Create CURL request (you can use multiple curl requests)
Get body data
Parse data
Make excel file
You can read HTML information using this guide.
A simple solution might be to use a Google Docs spreadsheet function like ImportXML(URL,path-expression).
More information and examples here:
http://www.seerinteractive.com/blog/importxml-cookbook/
http://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/
http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/
so what I want to mimic is the link share feature Facebook provides. You simply enter in the URL and then FB automatically fetches an image, the title, and a short description from the target website. How would one program this in javascript with node.js and other javascript libraries that may be required? I found an example using PHP's fopen function, but i'd rather not include PHP in this project.
Is what I'm asking an example of webscraping? Is all I need to do is retrieve the data from inside the meta tags of the target website, and then also get the image tags using CSS selectors?
If someone can point me in the right direction, that'd be greatly appreciated. Thanks!
Look at THIS post. It discusses scraping with node.js.
HERE you have lots of previous info on scraping with javascript and jquery.
That said, Facebook doesn't actually guess what the title and description and preview are, they (at least most of the time) get that info from meta tags present in the sites that want to be more accessible to fb users.
Maybe you could make use of that existing metadata to pull titles, descriptions and img previews. The docs on the available metadata is HERE.
Yes web-scraping is required and that's the easy part. The hard part is the generic algo to find headings and relevant texts and images.
How to scrape
You can use jsdom to download and create a DOM structure in your server and scrape that using jquery on your server. You can find a good tutorial at blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs as suggested by #generalhenry above.
What to scrape
I guess a good way to find the heading would be:-
var h;
for(var i=6; i<=1; i++)
if(h = $('h'+i).first()){
break;
}
Now h will have the title or undefined if it fails. The alternative for this could be simply get the page's title tag. :)
As for the images. List all or first few images on that page which are reasonably large, i.e. so as to filter out sprites used for buttons, arrows, etc.
And while fetching the remote data make sure that ProcessExternalResources flag is off. This will ensure that script tags for ads do not pollute the fetched page.
And yes the relevant text would be in some tags after h.
I have what I consider a bit of a tricky question. I am currently working on quite a large spread sheet (266 rows aith 70 coloumns and its only going to get bigger) that is a database of sorts and I want to remove it from Excel and put it on to an intranet page. I am currently writing it in a combination of HTML and Javascript for functionality, but it is becoming very hard to ensure that the data is in the right place. I am wondering if there is a possible way of being able to save the Excel spreadsheet into a certain format (like CSV or XML) and then write a program (for on a HTML page) that would display all of the infomation in a table automatically? is this even possible?
Unfortunatly i do not have access to a server to be able help with this, it all needs to be able to be coded in the page itself.
Thankyou for all your input Guys and Gals
Based on your comment, a normalized database for this type of thing would look like this:
table `workers`
- id
- name
- ...
table `trainings`
- id
- title
- description
- ...
table `workers_in_training`
- worker_id
- training_id
This allows you to create a logical matrix as well without the need to change the schema (keep adding columns) for each new training/worker. Of course, this realistically requires a database server of some sort and knowledge in a server side programming language (PHP, Python, Ruby, C#, anything). If you don't have that, an Access database/app may be an acceptable compromise. Doing it all in Javascript is certainly interesting, but is an idea you should abandon as early as possible.
Given your constraints, I would save the Excel spreadsheet as a CSV and put it in the same location as your HTML file, then use AJAX to fetch the contents of the CSV and dynamically generate a HTML table based on the contents.
Look here for how to fetch a URL's contents using AJAX (jQuery library): http://api.jquery.com/jQuery.get/
After fetching the URL content, you will have the CSV as a big string in a JavaScript variable. I'll let you have the fun of figuring out how to parse it :-)
Once you know how to parse your CSV string to recognise rows and columns, look here for how to generate HTML table dynamically using jQuery library: Building an HTML table on the fly using jQuery