I'm looking to count the instances of each unique value in an HTML table and return the results in a table of it's own. The table is generated from a user's text input. So for example the users input might look like this:
Report 46 Bob Marley 4/20/2013 Summary: I shot the sheriff Case #32 User Error
Report 50 Billy The Kid 7/14/2013 Summary: I'm just a boy in a grown up world Case #33 User Experience
Report 51 Oscar The Grouch 10/10/2013 Summary: Refuse, reuse, recycle Case #33 User Experience
where the large spaces are tabs.
Which would return:
<table>
<tr>
<td>Bob Marley</td><td>46</td><td>4/20/2013</td><td>Case #32</td><td>User Error</td>
</tr>
<tr>
<td>Billy The Kid</td><td>50</td><td>4/20/2013</td><td>Case #33</td><td>User Experience</td>
</tr>
<tr>
<td>Oscar The Grouch</td><td>51</td><td>10/10/2013</td><td>Case #33</td><td>User Experience</td>
</tr>
</table>
What I need to do is 1) tally the number of reports, 2) tally the number of times each case number appears, 2) and tally the number of times each category appears and then display it on the next page like such:
Number of reports:
3
Cases:
Case #33 - 2
Case #32 - 1
Categories:
User Experience - 2
User Error - 1
I'm looking for any suggestions on how I might approach this problem. I'm using and learning Javascript/HTML (and jQuery), but would be open to using PHP, SQL, etc. if those tools are more appropriate.
I was thinking of passing the table values to an array and then utilizing a for loop and regexes to count unique values, but I'm not sure if that's the best approach.
EDIT
One more detail that I didn't explicitly state is that I have access to the user input data (i.e. tab-separated text) prior to it being turned into a table. So, if it would be easier to tally the values in question prior to converting it into a table, then please let me know.
As far as PHP, you can store the table HTML into a string and load it into a DOM parser.
http://simplehtmldom.sourceforge.net/
This is what we use for most of our projects involving page scraping, though it would work just as well for parsing your HTML from a string using their function:
$html = str_get_html($yourHtmlString);
Then you can loop through each tr, and from there you can look into each td to add to your tallies.
i.e. to get the category of the third row, you would use:
$html->find("table", 0)->find("tr", 2)->find("td", 4)->plaintext;
You can loop through the table like:
$reportCount = 0;
$reportCases = array();
foreach ($html->find("table", 0)->find("tr") as $tableRow) {
$reportCount++;
$reportCases[] = $tableRow->find("td", 1)->plaintext;
}
etc. though of course also storing all your other necessary data, then formatting it into your table output as needed.
Related
I'm using a BIRT report with rather complicated query to get different metrics from multiple tables.
Those metrics are done for data between two dates set as parameter and with different periodicity.
So I don't know how many rows will be in the output.
In the end I get table like this:
Metric A | Metric B | Metric C
2017/01 1 0 4
2017/02 0 3 4
2017/03 1 2 3
In the report design, I need to display it as transposed (metric names are long and there is too many of them).
How can I do it within the report itself?
I think this is more difficult to be done in the query (I don't know how many rows I get in the result - it is dynamically taken from the dates and according to the periodicity).
I tried so far:
Cubes and crosstables - but how can I do it, when I don't know how many columns I get in the end?
Scripting - I can dynamically add more columns or create the table as it is. But how do I get the data from dataset? Here I tried to do something like this:
Firstly, in the dataset onFetch method, I put something in the report variable:
reportContext.setPersistentGlobalVariable("metric_1",row["metric_1"].toString());
Then, in the report beforeFactory, I try to access the variable:
mylabel.setText( reportContext.getPersistentGlobalVariable("metric_1") );
But the result is empty.
Is this a wrong way to do it? Do you have any ideas? I would like to do it in the script - set report data to the globalVariable, then in the beforeFactory access the resultSet and create table for it.
Do you have any other ideas?
P.S. it is very unfortunate that you can't have something like detailColumn (you do have detailRow).
Software
I'm using Pentaho Data Integration 5.4
Input data & explanation
Input data from a file (simplified, there are more columns):
number name
1009 ProductA
2150 ProductB
3235 ProductC
ProductD
ProductE
1234 ProductF
7765 ProductG
4566 ProductH
ProductI
9907 ProductJ
The issue is that I had an Excel file format xlsx which has the data with merged cells, and for one value of id there are 1..n rows of values.
After converting that file to csv values for next rows (other than first) are missing, despite the one column which was not merged (see example id=3, id=6).
I'm generating a sequence using step Add sequence, the input is sorted the way it was originally stored in a file.
Steps to achieve the goal
Basically what I need to do is:
Find first non-null value that has sequence_number less than current_row.sequence_number
Concatenate the value from field name to that matching row
Keep scanning next rows with sequence_number higher than the last scanned
As stated before, there can be 1..n rows of values for such case.
Expected output
number name
1009 ProductA
2150 ProductB
3235 ProductC; ProductD; ProductE
1234 ProductF
7765 ProductG
4566 ProductH; ProductI
9907 ProductJ
My approach
I believe I'm able to do this in a loop, by using Analytic Query and calculating LAG(1) and then concatenating the column name for one row with null values and discarding other column values from null row - and then doing this in a loop (for like 20 times assuming this is maximum), but I do consider this a bad idea.
There are probably better ways to achieve this result using for example Java Script step with scanning the rows backward from current (based on sequence number), but I'm unaware of those functions, if they do exist.
How can I achieve this using Modified Java Script Value step, or any other efficient way without using a loop for entire content of the file until there are no empty rows?
To solve this, I would use Modified Java Script Value to save the last seen product and use this for all rows, and then use Group By to group the columns.
Introduction
Merged adjacent cells in Excel files are presented on the image below.
When opened as a plain text file, it actually creates gaps (data from merged cell is missing) for every row but first that contains the merged cell.
number name
1000/P um6p1
um1p2
um1p3
1500 um2p1
9823 um3p1
83424 um4p1
um4p2
um4p3
um4p4
21390 um5p1
While #bolav answer addresses the problem, there is a simplier and probably more efficient approach to this issue in Kettle.
Approach
In Microsoft Excel Input step go to Fields tab and mark Repeat option as Y for columns that store values in merged cells
Use Sort rows on number column because Group by step needs the input to be sorted
Group by on field number and aggregate name with Concatenate strings separated by as type and ; as value
From Pentaho User Guide:
Repeat If set to Y, will repeat this value if the field in the next row is empty.
I've got a table which manages user scores, e.g.:
id scoreA scoreB ... scoreX
------ ------- ------- ... -------
1 ... ... ... ...
2 ... ... ... ...
Now i wanted to create a scoreboard which can be sorted by each of the scores (only descending).
However, I can't just query the entries and send them to the client (which renders them with Javascript) as the table contains thousands of entries and sending all of those entries to the client would create unreasonable traffic.
I came to the conclusion that all non-relevant entries (entries which may not show up in the scoreboard as the score is too low) should be discarded on the server-side with the following rule of thumb:
If any of the scores is within the top ten for this specific score keep the entry.
If none of the scores is within the top ten for this specific score discard it.
Now I ran into the question if this can be done efficiently with (My)SQL or if this processing should take place in the php-code querying the database to keep the whole thing performant.
Any help is greatly appreciated!
Go with rows, not columns, for storing scores. Have composite index on userid,score. A datetime column could also be useful. Consider not having the top 10 snapshot table anyway, just the lookup that you suggest. So an order by score desc and Limit 10 in query.
Not that the below reference is the authority on Covering Indexes, but to throw the term out there for your investigation. Good luck.
you can try to use INDEX for specific and performance enhances.
This will query specific results for your kind of problem.
Read about it here
good luck, buddy.
I would first fire a query to obtain the top 10. Then fire the query to get the results, using the top 10 in your sql.
I can't formulate the query until I know what you mean by top 10 - give an example.
I've recently started using Interactive Reports in my Oracle APEX application. Previously, all pages in the application used Classic Reports. The Interactive Report in my new page works great, but, now, I'd like to add a summary box/table above the Interactive Report on the same page that displays the summed values of some of the columns in the Interactive Report. In other words, if my Interactive Report displays 3 distinct manager names, 2 distinct office locations, and 5 different employees, my summary box would contain one row and three columns with the numbers, 3, 2, and 5, respectively.
So far, I have made this work by creating the summary box as a Classic Report that counts distinct values for each column in the same table that my Interactive Report pulls from. The problem arises when I try to filter my interactive report. Obviously, the classic report doesn't refresh based on the interactive report filters, but I don't know how I could link the two so that the classic report responds to the filters from the interactive report. Based on my research, there are ways to reference the value in the Interactive Report's search box using javascript/jquery. If possible, I'd like to reference the value from the interactive table's filter with javascript or jquery in order to refresh the summary box each time a new filter is applied. Does anyone know how to do this?
Don't do javascript parsing on the filters. It's a bad idea - just think on how you would implement this? There's massive amounts of coding to be done and plenty of ajax. And with apex 5 literally around the corner, where does it leave you when the APIs and markup are about to change drastically?
Don't just give in to a requirement either. First make sure how feasible it is technically. And if it's not, make sure you make it abundantly clear what the implications are in regard of time consumption. What is the real value to be had by having these distinct value counts? Maybe there is another way to achieve what they want? Maybe this is nothing more than an attempted solution, and not the core of the real problem. Stuff to think about...
Having said that, here are 2 options:
First method: Count Distinct Aggregates on Interactive reports
You can add these to the IR through the Actions button.
Note though, that this aggregate will be THE LAST ROW! In the example I've posted here, reducing the rows per page to 5 would push the aggregate row to the pagination set 3!
Second Method: APEX_IR and DBMS_SQL
You could use the apex_ir API to retrieve the IR's query and then use that to do a count.
(Apex 4.2) APEX_IR.GET_REPORT
(Apex 5.0) APEX_IR.GET_REPORT
Some pointers:
Retrieve the region ID by querying apex_application_page_regions
Make sure your source query DOES NOT contain #...# substitution strings. (such as #OWNER#.)
Then get the report SQL, rewrite it, and execute it. Eg:
DECLARE
l_report apex_ir.t_report;
l_query varchar2(32767);
l_statement varchar2(32000);
l_cursor integer;
l_rows number;
l_deptno number;
l_mgr number;
BEGIN
l_report := APEX_IR.GET_REPORT (
p_page_id => 30,
p_region_id => 63612660707108658284,
p_report_id => null);
l_query := l_report.sql_query;
sys.htp.prn('Statement = '||l_report.sql_query);
for i in 1..l_report.binds.count
loop
sys.htp.prn(i||'. '||l_report.binds(i).name||' = '||l_report.binds(i).value);
end loop;
l_statement := 'select count (distinct deptno), count(distinct mgr) from ('||l_report.sql_query||')';
sys.htp.prn('statement rewrite: '||l_statement);
l_cursor := dbms_sql.open_cursor;
dbms_sql.parse(l_cursor, l_statement, dbms_sql.native);
for i in 1..l_report.binds.count
loop
dbms_sql.bind_variable(l_cursor, l_report.binds(i).name, l_report.binds(i).value);
end loop;
dbms_sql.define_column(l_cursor, 1, l_deptno);
dbms_sql.define_column(l_cursor, 2, l_mgr);
l_rows := dbms_sql.execute_and_fetch(l_cursor);
dbms_sql.column_value(l_cursor, 1, l_deptno);
dbms_sql.column_value(l_cursor, 2, l_mgr);
dbms_sql.close_cursor(l_cursor);
sys.htp.prn('Distinct deptno: '||l_deptno);
sys.htp.prn('Distinct mgr: '||l_mgr);
EXCEPTION WHEN OTHERS THEN
IF DBMS_SQL.IS_OPEN(l_cursor) THEN
DBMS_SQL.CLOSE_CURSOR(l_cursor);
END IF;
RAISE;
END;
I threw together the sample code from apex_ir.get_report and dbms_sql .
Oracle 11gR2 DBMS_SQL reference
Some serious caveats though: the column list is tricky. If a user has control of all columns and can remove some, those columns will disappear from the select list. Eg in my sample, letting the user hide the DEPTNO column would crash the entire code, because I'd still be doing a count of this column even though it will be gone from the inner query. You could block this by not letting the user control this, or by first parsing the statement etc...
Good luck.
On our web application, the search results are displayed in sortable tables. The user can click on any column and sort the result. The problem is some times, the user does a broad search and gets a lot of data returned. To make the sortable part work, you probably need all the results, which takes a long time. Or I can retrieve few results at a time, but then sorting won't really work well. What's the best practice to display sortable tables that might contain lots of data?
Thanks for all the advises. I will certainly going over these.
We are using an existing Javascript framework that has the sortable table; "lots" of results means hundreds. The problem is that our users are at some remote site and a lot of delay is the network time to send/receive data from the data center. Sorting the data at the database side and only send one page worth of results at a time is nice; but when the user clicks some column header, another round trip is done, which always add 3-4 seconds.
Well, I guess that might be the network team's problem :)
Using sorting paging at the database level is the correct answer. If your query returns 1000 rows, but you're only going to show the user 10 of them, there is no need for the other 990 to be sent across the network.
Here is a mysql example. Say you need 10 rows, 21-30, from the 'people' table:
SELECT * FROM people LIMIT 21, 10
You should be doing paging back on the database server. E.g. on SQL 2005 and SQL 2008 there are paging techniques. I'd suggest looking at paging options for whatever system you're looking at.
What database are you using as there some good paging option in SQL 2005 and upwards using ROW_NUMBER to allow you to do paging on the server. I found this good one on Christian Darie's blog
eg This procedure which is used to page products in a category. You just pass in the pagenumber you want and the number of products on the page etc
CREATE PROCEDURE GetProductsInCategory
(#CategoryID INT,
#DescriptionLength INT,
#PageNumber INT,
#ProductsPerPage INT,
#HowManyProducts INT OUTPUT)
AS
-- declare a new TABLE variable
DECLARE #Products TABLE
(RowNumber INT,
ProductID INT,
Name VARCHAR(50),
Description VARCHAR(5000),
Price MONEY,
Image1FileName VARCHAR(50),
Image2FileName VARCHAR(50),
OnDepartmentPromotion BIT,
OnCatalogPromotion BIT)
-- populate the table variable with the complete list of products
INSERT INTO #Products
SELECT ROW_NUMBER() OVER (ORDER BY Product.ProductID),
Product.ProductID, Name,
SUBSTRING(Description, 1, #DescriptionLength) + '...' AS Description,
Price, Image1FileName, Image2FileName, OnDepartmentPromotion, OnCatalogPromotion
FROM Product INNER JOIN ProductCategory
ON Product.ProductID = ProductCategory.ProductID
WHERE ProductCategory.CategoryID = #CategoryID
-- return the total number of products using an OUTPUT variable
SELECT #HowManyProducts = COUNT(ProductID) FROM #Products
-- extract the requested page of products
SELECT ProductID, Name, Description, Price, Image1FileName,
Image2FileName, OnDepartmentPromotion, OnCatalogPromotion
FROM #Products
WHERE RowNumber > (#PageNumber - 1) * #ProductsPerPage
AND RowNumber <= #PageNumber * #ProductsPerPage
You could do the sorting on the server. AJAX would eliminate the necessity of a full refresh, but there'd still be a delay. Sides, databases a generally very fast at sorting.
For these situations I employ techniques on the SQL Server side that not only leverage the database for the sorting, but also use custom paging to ONLY return the specific records needed.
It is a bit of a pain to implemement at first, but the performance is amazing afterwards!
How large is "a lot" of data? Hundreds of rows? Thousands?
Sorting can be done via JavaScript painlessly with Mochikit Sortable Tables. However, if the data takes a long time to sort (most likely a second or two [or three!]) then you may want to give the user some visual cue that soming is happening and the page didn't just freeze. For example, tint the screen (a la Lightbox) and display a "sorting" animation or text.