Extract(Read) > Transform(Test&Apply) > Load(Learn) Blog

October 4, 2008

Understand Your Data with Oracle’s In-Database Statistical Functions

Filed under: Oracle Business Intelligence — H.Tonguç Yılmaz @ 8:20 pm

Especially after 10g Oracle’s in-database functionalities like Analytics, Grouping and XML Functions, Regular Expressions and Model Clause are much more popular. Since they are in-database, all are scalable, performant and parallelizable, as a result knowing them makes a lot of difference. Here is a documentation linked list of 10.2 functions, unlike other statistical software vendors all these functionalities are FREE inside the Enterprise Edition license.

There are two more categories which requires mentioning: Oracle’s In-Database Data Mining and Statistical Functions. These statistical functions include descriptive statistics, hypothesis testing, correlations analysis, test for distribution fit, cross tabs with Chi-square statistics, analysis of variance(ANOVA), and Pareto analysis. Also DBMS_STAT_FUNCS.SUMMARY procedure summarizes all the basic descriptive statistics for a table’s attribute;


-- stats_mode: for most common value
select median(salary), stats_mode(salary) from employees ;

MEDIAN(SALARY) STATS_MODE(SALARY)
-------------- ------------------
          6200               2500

select stats_t_test_one(salary, 1000) two_sided_p_value from employees ;

TWO_SIDED_P_VALUE
-----------------
       8.2692E-27

select corr_s(employee_id, salary) coefficient from employees ;

COEFFICIENT
-----------
 -.04482358

set serveroutput on
declare
	s1 DBMS_STAT_FUNCS.SummaryType;
begin
    -- must be a numeric column
	DBMS_STAT_FUNCS.SUMMARY(p_ownername=> 'HR', p_tablename=> 'EMPLOYEES', p_columnname=>'SALARY', p_sigmavalue=>3, s=> s1);
	dbms_output.put_line('SUMMARY STATISTICS');
	dbms_output.put_line('Count: '||s1.count);
	dbms_output.put_line('Min: '||s1.min);
	dbms_output.put_line('Max: '||s1.max);
	dbms_output.put_line('Range: '||s1.range);
	dbms_output.put_line('Mean: '||round(s1.mean));
	dbms_output.put_line('Mode Count: '||s1.cmode.count);
	dbms_output.put_line('Mode: '||s1.cmode(1));
	dbms_output.put_line('Variance: '||round(s1.variance));
	dbms_output.put_line('Stddev: '||round(s1.stddev));
	dbms_output.put_line('Quantile 5 '||s1.quantile_5);
	dbms_output.put_line('Quantile 25 '||s1.quantile_25);
	dbms_output.put_line('Median '||s1.median);
	dbms_output.put_line('Quantile 75 '||s1.quantile_75);
	dbms_output.put_line('Quantile 95 '||s1.quantile_95);
	dbms_output.put_line('Extreme Count: '||s1.extreme_values.count);
	dbms_output.put_line('Extremes: '||s1.extreme_values(1));
	dbms_output.put_line('Top 3: '||s1.top_5_values(1)||','||s1.top_5_values(2)||','||s1.top_5_values(3));
	dbms_output.put_line('Bottom 3: '||s1.bottom_5_values(5)||','||s1.bottom_5_values(4)||','||s1.bottom_5_values(3));
end;
/
SUMMARY STATISTICS
Count: 107
Min: 2100
Max: 24000
Range: 21900
Mean: 6462
Mode Count: 1
Mode: 2500
Variance: 15283141
Stddev: 3909
Quantile 5 2500
Quantile 25 3100
Median 6200
Quantile 75 8900
Quantile 95 12700
Extreme Count: 1
Extremes: 24000
Top 3: 24000,17000,17000
Bottom 3: 2100,2200,2200

PL/SQL procedure successfully completed.

After 11g mining the unstructured data is possible with these functions. All these in-database functionality other than SQL have additional Java APIs, so cost is not the only advantage as SQL and Java are the most popular standard IT platforms of today. With all these in-database functionality Oracle database now is a sophisticated, scalable and secure analytical engine more than ever. :)

http://otn.oracle.com/bi

ps: 2nd BIWA summit presentations are available at http://www.oraclebiwa.org and currently it is free to join Oracle BIWA SIG. :)

Advertisements

DBMS_COMPARISON supplied package after 11g Release 1

Filed under: Oracle 11g New Features,Oracle How To — H.Tonguç Yılmaz @ 12:24 pm

I discussed 10g’s ORA_HASH SQL function to determine if two Oracle tables’ data are equal or not with this post before, especially if you are working with huge tables and they are on different locations this method is very effective.

In the same post I also mentioned some favorite schema and data comparison methods used within the community, so this one is a following post to that post to mention the new 11g supplied package; DBMS_COMPARISON.

After a fast startup demonstration based on psoug’s library I saw that this new feature also uses ORA_HASH SQL function. :) Also for the schema comparison part which dictionary checks are made can be followed inside a 10046 SQL trace as demonstrated below;


conn hr/hr

DROP TABLE depts PURGE ;

CREATE TABLE depts NOLOGGING AS SELECT * FROM departments ;

CREATE UNIQUE INDEX ui_depts ON depts(department_id) NOLOGGING ;

DELETE FROM depts WHERE department_id = 10;

1 row deleted.

COMMIT;

ALTER SESSION SET max_dump_file_size=unlimited;
ALTER SESSION SET timed_statistics = true;
ALTER SESSION SET statistics_level = ALL ;

ALTER SESSION SET tracefile_identifier = dbms_comparison ;

ALTER SESSION SET EVENTS '10046 TRACE NAME CONTEXT FOREVER, LEVEL 8';

exec dbms_comparison.create_comparison(comparison_name=>'compare_test', schema_name=>'HR', object_name=>'DEPTS', dblink_name=>NULL,  remote_schema_name=>'HR', remote_object_name=>'DEPARTMENTS',  scan_percent=>90);

set serveroutput on

DECLARE
  retval dbms_comparison.comparison_type;
BEGIN
  IF dbms_comparison.compare('compare_test', retval, perform_row_dif=>TRUE) THEN
    dbms_output.put_line('No Differences');
  ELSE
    dbms_output.put_line('Differences Found');
  END IF;
END;
/
Differences Found

PL/SQL procedure successfully completed.

ALTER SESSION SET EVENTS '10046 TRACE NAME CONTEXT OFF';

select value from v$diag_info where name = 'Default Trace File' ;

VALUE
----------------------------------------------------------------------------
/u01/app/oracle/diag/rdbms/tong/tong/trace/tong_ora_4402_DBMS_COMPARISON.trc

CMD > tkprof /u01/app/oracle/diag/rdbms/tong/tong/trace/tong_ora_4402_DBMS_COMPARISON.trc tong_ora_4402_DBMS_COMPARISON.txt sys=yes waits=yes sort=execpu

set null @
set colsep |
set linesize 2500

select *  FROM  user_comparison_row_dif ;

COMPARISON_NAME               |   SCAN_ID|LOCAL_ROWID       |REMOTE_ROWID

INDEX_VALUE

STA|LAST_UPDATE_TIME

COMPARE_TEST                  |         4|@                 |AAARADAAFAAAAA3AAA
10
DIF|04-OCT-08 02.58.03.306182 PM

exec dbms_comparison.drop_comparison(comparison_name=>'compare_test') ;

There are several constraints to use this method like the local database that runs the subprograms in the DBMS_COMPARISON package must be at least 11.1 and the remote database must be 10.1 at least. Non-Oracle databases are not supported and the database character sets must be the same for the databases being compared. But I am pretty sure with feature releases this package will be enhanced and be used with other database features.

ps: During my unconference session at OOW08 – APEX Test Drive – I mentioned that Apex 3 on 11.1 under Utilities tab also has a GUI support to use this new supplied package.

Create a free website or blog at WordPress.com.