Extract(Read) > Transform(Test&Apply) > Load(Learn) Blog

Understand Your Data with Oracle’s In-Database Statistical Functions


Especially after 10g Oracle’s in-database functionalities like Analytics, Grouping and XML Functions, Regular Expressions and Model Clause are much more popular. Since they are in-database, all are scalable, performant and parallelizable, as a result knowing them makes a lot of difference. Here is a documentation linked list of 10.2 functions, unlike other statistical software vendors all these functionalities are FREE inside the Enterprise Edition license.

There are two more categories which requires mentioning: Oracle’s In-Database Data Mining and Statistical Functions. These statistical functions include descriptive statistics, hypothesis testing, correlations analysis, test for distribution fit, cross tabs with Chi-square statistics, analysis of variance(ANOVA), and Pareto analysis. Also DBMS_STAT_FUNCS.SUMMARY procedure summarizes all the basic descriptive statistics for a table’s attribute;

-- stats_mode: for most common value
select median(salary), stats_mode(salary) from employees ;

-------------- ------------------
          6200               2500

select stats_t_test_one(salary, 1000) two_sided_p_value from employees ;


select corr_s(employee_id, salary) coefficient from employees ;


set serveroutput on
	s1 DBMS_STAT_FUNCS.SummaryType;
    -- must be a numeric column
	DBMS_STAT_FUNCS.SUMMARY(p_ownername=> 'HR', p_tablename=> 'EMPLOYEES', p_columnname=>'SALARY', p_sigmavalue=>3, s=> s1);
	dbms_output.put_line('SUMMARY STATISTICS');
	dbms_output.put_line('Count: '||s1.count);
	dbms_output.put_line('Min: '||s1.min);
	dbms_output.put_line('Max: '||s1.max);
	dbms_output.put_line('Range: '||s1.range);
	dbms_output.put_line('Mean: '||round(s1.mean));
	dbms_output.put_line('Mode Count: '||s1.cmode.count);
	dbms_output.put_line('Mode: '||s1.cmode(1));
	dbms_output.put_line('Variance: '||round(s1.variance));
	dbms_output.put_line('Stddev: '||round(s1.stddev));
	dbms_output.put_line('Quantile 5 '||s1.quantile_5);
	dbms_output.put_line('Quantile 25 '||s1.quantile_25);
	dbms_output.put_line('Median '||s1.median);
	dbms_output.put_line('Quantile 75 '||s1.quantile_75);
	dbms_output.put_line('Quantile 95 '||s1.quantile_95);
	dbms_output.put_line('Extreme Count: '||s1.extreme_values.count);
	dbms_output.put_line('Extremes: '||s1.extreme_values(1));
	dbms_output.put_line('Top 3: '||s1.top_5_values(1)||','||s1.top_5_values(2)||','||s1.top_5_values(3));
	dbms_output.put_line('Bottom 3: '||s1.bottom_5_values(5)||','||s1.bottom_5_values(4)||','||s1.bottom_5_values(3));
Count: 107
Min: 2100
Max: 24000
Range: 21900
Mean: 6462
Mode Count: 1
Mode: 2500
Variance: 15283141
Stddev: 3909
Quantile 5 2500
Quantile 25 3100
Median 6200
Quantile 75 8900
Quantile 95 12700
Extreme Count: 1
Extremes: 24000
Top 3: 24000,17000,17000
Bottom 3: 2100,2200,2200

PL/SQL procedure successfully completed.

After 11g mining the unstructured data is possible with these functions. All these in-database functionality other than SQL have additional Java APIs, so cost is not the only advantage as SQL and Java are the most popular standard IT platforms of today. With all these in-database functionality Oracle database now is a sophisticated, scalable and secure analytical engine more than ever. :)


ps: 2nd BIWA summit presentations are available at http://www.oraclebiwa.org and currently it is free to join Oracle BIWA SIG. :)