Best way to scrape a set of pages with mixed content

Best way to scrape a set of pages with mixed content - javascript

I’m trying to show a list of lunch venues around the office with their today’s menus. But the problem is the websites that offer the lunch menus, don’t always offer the same kind of content.
For instance, some of the websites offer a nice JSON output. Look at this one, it offers the English/Finnish course names separately and everything I need is available. There are couple of others like this.
But others, don’t always have a nice output. Like this one. The content is laid out in plain HTML and English and Finnish food names are not exactly ordered. Also food properties like (L, VL, VS, G, etc) are just normal text like the food name.
What, in your opinion, is the best way to scrape all these available data in different formats and turn them into usable data? I tried to make a scraper with Node.js (& phantomjs, etc) but it only works with one website, and it’s not that accurate in case of the food names.
Thanks in advance.

You may use something like kimonolabs.com, they are much easier to use and they give you APIs to update your side.
Remember that they are best for tabular data contents.

There my be simple algorithmic solutions to the problem, If there is a list of all available food names this can be really helpful, you find the occurrence of a food name inside a document (for today).
If there is not any food list, You may use TF/IDF. TF/IDF allows to calculate the score of a word inside a document among the current document and also other documents. But this solution needs enough data to work.
I think the best solution is some thing like this:
Creating a list of all available websites that should be scrapped.
Writing driver classes for each website data.
Each driver has the duty of creating the general domain entity from its standard document.
If you can use PHP, Simple HTML Dom Parser along with Guzzle would be a great choice. These two will provide a jQuery like path finder and a nice wrapper arround HTTP.

You are touching really difficult problem. Unfortunately there are no easy solutions.
Actually there are two different parts to solve:
data scraping from different sources
data integration
Let's start with first problem - data scraping from different sources. In my projects I usually process data in several steps. I have dedicated scrapers for all specific sites I want, and process them in the following order:
fetch raw page (unstructured data)
extract data from page (unstructured data)
extract, convert and map data into page-specific model (fully structured data)
map data from fully structured model to common/normalized model
Steps 1-2 are scraping oriented and steps 3-4 are strictly data-extraction / data-integration oriented.
While you can easily implement steps 1-2 relatively easy using your own webscrapers or by utilizing existing web services - data integration is the most difficult part in your case. You will probably require some machine-learning techniques (shallow, domain specific Natural Language Processing) along with custom heuristics.
In case of such a messy input like this one I would process lines separately and use some dictionary to get rid Finnish/English words and analyse what has left. But in this case it will never be 100% accurate due to possibility of human-input errors.
I am also worried that you stack is not very well suited to do such tasks. For such processing I am utilizing Java/Groovy along with integration frameworks (Mule ESB / Spring Integration) in order to coordinate data processing.
In summary: it is really difficult and complex problem. I would rather assume less input data coverage than aiming to be 100% accurate (unless it is really worth it).

Related

Classifier to Predict activities taking place

I have multiple datasets here that i took from Kaggle. There are multiple csv files and each csv file is made specifically for sit, stand, walking, running etc. The data is taken from sensors like accelerometers and gyroscopes. The values in datasets are of axis like x, y and z.
Sample Data
Here is a sample dataset of jogging. Now i need to make classifiers in my program so that my program can detect itself whether the data is of jogging, sitting, standing etc. I want to mix all the datasets in a single csv file and then upload it into my webapge and then i want the javascript code to start detecting whether a particular row is of sitting, standing, jogging etc. I don't want any code help but instead i just need a little explanation or a way to start coding it. How can i started making such classifier? I know it is kind of broad question but i think i have tried to explain myself in best way possible. Once my program has detected every row with specific activity it will count all the activities separately and then show it in a table format in webpage.

In order to answer properly to your question, it would be very helpful to know which is your level of understanding and experience with machine learning.
If you are a beginner I would suggest to try to run and understand a couple of tutorials that can be easily found on the web.
If you need an idea of which is the "standard" approach for machine learning development, I will try to give you a general idea of the process.
You can summarize the process in these main steps:
Data pre-processing-> Data splitting -> Feature selection -> Model Training -> Validation -> Deployment
Data pre-processing is meant to clean and format the data: removing NA values, decision about categorical variables, outlier analysis,.... This is a complex step that depends on the application. In your case I would start checking that the data in the different data-sets are homogeneous, i.e. the features have the same meaning across csv and corresponding features respect the same distribution. While the meaning of each feature should be explained in the description of your csv, the check of the distributions could be easily done plotting the box-plots for each feature and csv. If distributions of the same feature across different csv files don't overlap you should investigate further the issue.
An important step in the design of a good model is the splitting of the data. You should split your data in training/validation set (training/validation/test for a more comprehensive approach). This step allows you to train your model on the training set and test the model on the validation set computing unbiased performance of your model. I suggest here to become familiar with concepts as: Cross Validation, stratified-cross-validation, nested-cross-validation for hyper-parameter tuning, overfitting, bias,.... The Validation of the model will give you an idea of the expected performance that it will have on unseen data. If you are considering the use of more than one model, you can use the validation results to choose the "best" one. I suggest here a comparison using the confidence interval or if possible a significance test (e.g t-test, anova,...). Before the deployment the model is trained on all the available data.
The choice of the model depends on the data that you are using: number of samples, number of features, type of variable (numerical, categorical),....
I'm not an expert of javascript, but I believe (just a feeling) that python and R are more common choices for developing Machine learning applications. Both have libraries specifically developed for the task and you can find a lot of materials and tutorial around.
With a bit of more context I think that I could be more specific.
I hope it helps

RESTful API with related resources

I'm in the process of designing my first serious RESTful API, which will sit above a WCF service.
There are resources like; outlet, schedule and job. A schedule is always owned by an outlet, and a schedule will contain 0 or more jobs. A job does not have to be on a schedule.
I keep coming back to thinking that resources should be addressable in the same type of way resources are addressed on a file system. This would mean I'd have URI's like:
/outlets
/outlets/4/schedules
/outlets/4/schedules/1000/jobs
/outlets/4/schedules/1000/jobs/5123
Things start to get messy though when considering how to pull resources back under different situations though.
e.g. I want a job not on a schedule:
/outlets/4/jobs/85 (this now means we've got 2 ways to pull a job back that's on a schedule)
e.g. I want all schedules regardless of outlet:
/schedules or /outlets/ALL/schedules
There are also lots of other more complex requirements but I'm sure you get the gist.
File system's have a good, logical way of addressing resources. You can obviously create symbolic links and achieve something approximating what I describe but it'd be messy. It'll be even messier once things get even slightly more complex, such as adding the ability to get schedules by date:
/outlets/4/2016-08-29/schedules
And without using query string parameters I'm not even sure how I'd request back all jobs that are NOT on a schedule. The following feels wrong because unscheduled is not a resource:
/outlets/4/unscheduled/jobs
So, I'm coming to think that file system type addressing is only going to work for the simplest of services (our underlying system has hundreds of entity types, with some very complex relationships and a huge number of operations).
Having multiple ways of doing the same thing tends to lead to confusion and messy documentation and I want to avoid it. As a result I'm almost forced to opt for going with the lowest common denominator and choosing very simple address forms - like the 3rd one below:
/outlets/4/schedules/1000/jobs/5123
/outlets/4/jobs/5123
/jobs/5123
From these very simple address forms I would then need to expand using query string parameters to do anything more complex, e.g:
/jobs?scheduleId=1000
/jobs?outletId=4
/jobs?outletId=4&fromDate=2016-01-01&toDate=2016-01-31
This feels like it's going against the REST model though and query string parameters like this aren't predictable, so far from the "no docs needed" idea.
OK, so at the minute I'm almost on the side of the fence that is saying in order to get a clean, maintainable API I'm going to have to go with very simple resource addresses, use query string parameters extensively and have good documentation.
Anyway, this doesn't feel like the conclusion I should have arrived at. Where have I gone wrong?

Welcome to the world of REST API programming. These are the hard problems which we all face when trying to apply general principles to specific situations. There is no clear and easy answer to your questions, but here are a few additional tips that may be useful.
First, you're right that the file-system approach to addressing breaks down when you have complex relationships. You'll only want to establish that sort of addressing when there's a true hierarchy there.
For example, if all jobs were part of a single schedule, then it would make sense to look to schedules/{id}/jobs/{id} to get to a given job. If you think of it from a data-storage perspective, you could imagine there being an XML file for each schedule, and the jobs would just be elements within that file.
However, it sounds like in this particular case your data is more relational. From a data-storage perspective, you'd represent each job as a row in a database table, and establish some foreign key relationships to tie some jobs to some schedules. Your addressing scheme should reflect this by making /jobs a top-level endpoint, and using optional query string parameters to filter by schedule or outlet when it makes sense to do so.
So you're on the right track. One more thing you might want to consider is OData, which extends the basic REST principles with a standards-oriented way of representing things like filtering with query string parameters. The address syntax feels a little "out there", but it does a pretty good job of handling the situations where straight REST starts falling apart. And because it's more standardized, there are tools available to help with things like translating from a data layer into an OData endpoint, or generating client-side proxy helpers based on the metadata exposed by that endpoint.
This feels like it's going against the REST model though and query string parameters like this aren't predictable, so far from the "no docs needed" idea.
If you use OData, then its spec combines with the metadata produced by your tooling to become your documentation. For example, your metadata says that a job has a date property which represents a date. Then the OData spec provides a way to represent filter queries for a date value. From this information, consumers can reliably produce a filter query that will "just work" because you're using a framework server-side to do the hard parts. And if they don't feel like memorizing how OData URLs work, they can generate a client proxy in the language of their choice so they can generate the appropriate URL via their favorite syntax.

How to make a mulitlingual website?

I am an amateur programmer and I want to make a mulitlingual website. My questions are: (For my purposes let the English website be website nr 1 and the Polish nr 2)
Should it be en.example.com and pl.example.com or maybe example.com/en and example.com/pl?
Should I make the full website in one language and then translate it?
How to translate it? Using XML or what? Should website 1 and website 2 be different html files or is there a way to translate a html file and then show the translation using XML or something?
If You need any code or something tell me. Thank You in advance :)

1) I don't think it makes much difference. The most important thing is to ensure that Google can crawl both languages, so don't rely on JavaScript to switch between languages, have everything done server side so both languages can be crawled and ranked in Google.
2) You can do one translation then the other, you just have to ensure that the layout expands to accommodate more/less text without breaking it. Maybe use lorem ipsum whilst designing.
3) I would put the translations into a database and then call that particular translation depending on whether it is EN or PL in the domain name. Ensure that the webpage and database are UTF-8 encoding otherwise you will find that you get 'funny' characters being displayed.

My Advice is that you start to use any Framework.
For instance if you use CakePHP then you have to write
__('My name is')
and in translate file
msgid "My name is"
msgstr "Nazywam się"
Then you can easy translate to any other language and its pretty easy to implement.
Also if you do not want to use Framework you can check this link to see example how it works:
http://tympanus.net/codrops/2009/12/30/easy-php-site-translation/

While this question probably is not a good SO question due to its broad nature. It might be relevant to many users.
My approach would be templates.
Your suggestion of having two html files is bad for the obvious reason of duplication- say you need to change something in your site. You would always need to change two html files- bad.
Having one html file and then parsing it and translating it sounds like a massive headache.
Some templating framework could help you massively. I have been using Smarty, but that's a personal choice and there are many options here.
The idea is you make a template file for your html and instead of actual content you use labels. Then in your php code you include the correct language depending on cookies, user settings or session data.
Storing labels is another issue here. Storing them in a database is a good option, however, remember you do not wish to make 100's of queries against a database for fetching each label. What you can do is store them in a database and then have it generate a language file- an array of labels->translations for faster access and regenerate these files whenever you add/update labels.
Or you can skip the database altogether and just store them in files, however, as these grow they might not be as easy to maintain.

I think the easiest mistake for an "amateur programmer" to make in this area is to allow the two (or more) language versions of the site to diverge. (I've seen so-called professionals make that mistake too...) You need to design it so everything that's common between the two versions is shared, so that when the time comes to make changes, you only need to make the changes in one place. The way you do this will depend on your choice of tools, and I'm not going to start advising on that, because it depends on many factors, for example the extent to which your content is database-oriented.
Closely related to this are questions of who is doing the translation, how the technical developers and the translators work together, and how to keep track of which content needs to be re-translated when it changes. These are essentially process questions rather than coding questions, so not a good fit for SO.
Don't expect that a site for one language can be translated without any technical impact; you will find you have made all sorts of assumptions about the length of strings, the order of fields, about character coding and fonts, and about cultural things like postcodes, that turn out to be invalid when you try to convert the site to a different language.

You could make 2 different language files and use php constants to define the text you want to translate for example:
lang_pl.php:
define("_TEST", "polish translation");
lang_en.php:
define("_TEST", "English translation");
now you could make a choice for the user between polish or english translation and based on that you can include the language file.
So if you would have a text field you put its value to _TEST (inbetween php brackets).
and it would show the translation of the chosen option.

The place i worked was doing it like this: they didn't have to much writing on their site, so they were keeping it in a database. As your tags have a php , I assume you know how to use databases. They had a mysql table called languages with language id(in your case 1 for en and 2 for pl) and texts in columns. So the column names were main heading, intro_text, about_us... when the user comes it selects the language and a php request get the page with that language. This is easy because your content is static and can be translated before site gets online, if your content is dynamic(users add content) you may need to use a translation machine because you cannot expect your users to write their entry in all languages.

Natural Language Processing Database Querying

I need to develop natural language querying tool for a structured database. I tried two approaches.
using Python nltk (Natural Language Toolkit for python) using
Javascript and JSON (for data source)
In the first case I did some NLP steps to format the natural query by doing removing stop words, stemming, finally mapping keywords using featured grammar mapping. This methodology works for simple scenarios.
Then I moved to second approach. Finding the data in JSON and getting corresponding column name and table name , then building a sql query. For this one, I also implemented removing stop words, stemming using javascript.
Both of these techniques have limitations.I want to implement semantic search approach.
Please can anyone suggest me better approach to do this..

Semantic parsing for NLIDB (natural language interface to data bases) is a very evolved domain with many techniques: rule based methods (involving grammars) or machine learning techniques. They cover a large range of query inputs, and offer much more results than pure NL processing or regex methods.
The technique I favor is based on Feature based context-free grammars FCFG. For starters, in the NTLK book available online, look for the string "sql0.fcfg". The code example shows how to map the NL phrase structure query "What cities are located in China" into an SQL query "SELECT City FROM city_table WHERE Country="china" via the feature "SEM" or semantics of the FCFG.
I recommend Covington's books
NLP for Prloog Programmers (1994)
Prolog Programming in Depth (1997)
They will help You go a long way. These PDF's are downloadable from his site.

As I commented, I think you should add some code, since not everyone has read the book.
Anyway my conclusion is that yes, as you said it has a lot of limitations and the only way to achieve more complex queries is to write very extensive and complete grammar productions, a pretty hard work.

Is it better to store this information in an array or database?

I am creating a list and each item contains 3 properties: Caption, link, and image name. The list will be a "Trending now" list, where a related picture about the article is shown, a caption to the article, and a link to that article. Would it be best to store this information in an array or a database? With an array I could update the list, by adding 3 new properties, and removing the last 3. With a database, I could make a form where I submit the 3 properties and it'll update on its own without me touching the code. Would it be better to make this system in a Javascript array, or database? Wouldn't it be better to make it into an array for faster speeds? The list will have 10 items, each item has 3 properties.

For a list of 10 items, you can definitively go with a simple Array. If you need to store a bigger amount of data, than try localStorage.
Whichever solution you use, keep in mind that it will always be processed and stored in the browser.

Razor - your question touches many principles of programming. Being a beginner myself, I remember having had exactly those questions not too long ago.
That is why I answer in that 'beginner's' spirit:
if this were to be a web application then your 'trending now list' might be written as a ul list with li items in a section in an index.html in html code with css.style to style the list and your page.
Then you might use javascript, jQuery, d3.js etc. or other languages such as php to access and do something with data from those html elements.
pseudo-code example to get a value from an element:
var collectedValue = $("#your_element_id).value;
To get values into an array you would loop over your item.values pseudo-code:
for (all item.value collectable){
my_array.push(item.values);
}
Next you would have to decide how to store and retrieve values and or arrays.
This can be done client side, more or less meaning: it stays with the browser you are working in. Or server side, meaning more or less your browser interacts with a server on an 'other' computer to save and retrieve data.
Speed is an issue when you have huge data sets, otherwise it is not really an issue; certainly not for a short list. In my thinking speed is a question for commercial applications.
In case you use a database, then you have to learn the database language and manage the database, access to it, and so on. It is not overly complex in mysql and php ( that is how I started ) but you would have to learn it.
In html/css/javascript solutions others have pointed out 'JSON', which is often used for purposes such as yours.
'localStorage' is very straight forward, but has its limitations. Both these are easily understandable.
To start, I myself worked through simple database examples about mysql/php
Then I improved my html/css experience. Then I started gaining experience with javascript.
Best would be if you would enable yourself to answer your questions yourself:
learn principles of html/css to present your trending now list
learn principles of javascript to manipulate your elements
Set up a server on your computer to learn how to interact with server side ( with MAMP or such packages)
learn principles of mysql/php
learn about storage options client or server side
It is fun, takes a while, and your preferences and programming solutions will depend on your abilities in the various languages.
I hope this answer is not perceived as being too simplistic or condescending; your question seemed to imply that such 'beginner's talk' might be helpful to you.

We Keep Coding

JavaScript is the programming language of the Web.