Scalar Subqueries
What are scalar subqueries?
Scalar subqueries are a great feature of Oracle, and is another example of how Oracle's SQL
implementation really does set itself apart from all the rest. Scalar subqueries allow you to
embed SQL statements which return a scalar value "within" SQL statements,
otherwise known as "SELECTing a SELECT".
The simplest example is something like this :
SQL> SELECT
2 ( SELECT sysdate FROM dual )
3 FROM dual;
(SELECTSY
---------
07-MAR-06
So, as you can see, no longer do you need PL/SQL functions to issue a cursor and return a value,
you can embed it directly in the SELECT statement.
So, what advantages does this bring? Well, there are many, from the simplifying of cursors, to the
reduction of context switching. I go through the major
reasons why you may want to consider this type of approach in the following sections.
An alternative to OUTER JOINs
Okay, imagine the situation, you've got a cursor, driving off "t1" and you need to join to "t2" to get
some lookup value, BUT you realise that you actually need an OUTER JOIN to "t2", since it may be
deficient. So, what do you do? Well, you could do this :
SQL> CREATE TABLE t1 ( a VARCHAR2(10), b VARCHAR2(10) );
Table created.
SQL> CREATE TABLE t2 ( b VARCHAR2(10), b_desc VARCHAR2(100) );
Table created.
SQL> INSERT INTO t1 ( a, b ) VALUES ( 'A1', 'B1' );
1 row created.
SQL> INSERT INTO t1 ( a, b ) VALUES ( 'A1', 'B2' );
1 row created.
SQL> INSERT INTO t2 ( b, b_desc ) VALUES ( 'B1', 'B1 DESCRIPTION' );
1 row created.
SQL> SELECT t1.a, t1.b, t2.b_desc
2 FROM
3 t1
4 LEFT OUTER JOIN t2 ON ( t2.b = t1.b );
A B B_DESC
---------- ---------- --------------------
A1 B1 B1 DESCRIPTION
A1 B2
And there's absolutely nothing wrong with this. However, there are situations where implementing an
OUTER JOIN may negatively impact performance, especially on queries more complex than this one.
With scalar subqueries, this can be rewritten as :
SQL> SELECT t1.a, t1.b, ( SELECT t2.b_desc FROM t2 WHERE t2.b = t1.b ) b_desc
2 FROM t1;
A B B_DESC
---------- ---------- --------------------
A1 B1 B1 DESCRIPTION
A1 B2
Taking advantage of the fact that scalar subqueries return NULL when there's "no data found".
There have been countless situations where this type of approach (especially when coupled with
Scalar subquery caching, which we'll come on to next) has
been the cause of significant performance enhancements. Oracle can simply come up with a more efficient
plan when the OUTER JOIN(s) are "removed".
Note, see Using scalar subqueries to return multiple values to
understand how to use this technique when you are referencing more than one column from the deficient
table.
Scalar subquery "caching"
This is perhaps (to me) the most important, and surprising, advantage of using scalar subqueries, the
ability of Oracle to "cache" the results of a scalar subquery and reuse the value, even though the
query "should" be called more than once. It's easiest to explain with an example.
For this, I need a function which is specifically written to record how many times the function is
called. This function does this by incrementing the v$session column CLIENT_INFO via the built-in
Oracle package DBMS_APPLICATION_INFO.
Note, there is an important note later on regarding the datatype of the inputs to the function.
SQL> CREATE OR REPLACE FUNCTION f_subq_test( p_in IN NUMBER )
2 RETURN NUMBER
3 AS
4 BEGIN
5 DBMS_APPLICATION_INFO.SET_CLIENT_INFO(USERENV('client_info')+1);
6 RETURN p_in;
7 END f_subq_test;
8 /
Function created.
Okay, now we can begin. To prove the function does what it should, we'll call it in a SQL statement
which returns 10 rows :
SQL> exec DBMS_APPLICATION_INFO.SET_CLIENT_INFO(0);
PL/SQL procedure successfully completed.
SQL> SELECT f_subq_test(1) FROM ( SELECT null FROM dual CONNECT BY LEVEL <= 10 );
F_SUBQ_TEST(1)
--------------------------------------------------------------------------------
1
1
1
1
1
1
1
1
1
1
10 rows selected.
SQL> SELECT client_info FROM v$session WHERE audsid = USERENV('SESSIONID');
CLIENT_INFO
----------------------------------------------------------------
10
Okay, so it works OK. But, consider for a moment, what this would mean in a production environment.
Basically, the function has the same inputs ("x") and returns the same outputs ("x"), this function
would be what is called a deterministic function. In this situation, though, this is grossly inefficient.
The function is called 10 times, even though it's doing the same thing each time. Compare this with
any of your functions in SELECT statements (or WHERE clauses etc.), which return the same value for
the same input.
How can we optimise this? Well, it should be obvious where I'm going on this. Wrapping
the function call in a scalar subquery actually causes Oracle to cache the results and reuse them :
SQL> SELECT ( SELECT f_subq_test(1) FROM dual )
2 FROM ( SELECT null FROM dual CONNECT BY LEVEL <= 10 );
(SELECTF_SUBQ_TEST(1)FROMDUAL)
-------------------------------------------------------------------------------
1
1
1
1
1
1
1
1
1
1
10 rows selected.
SQL> SELECT client_info FROM v$session WHERE audsid = USERENV('SESSIONID');
CLIENT_INFO
----------------------------------------------------------------
1
So, suddenly we have a single function call for the whole query. The effect of this simple change
can be dramatic, even for relatively "inexpensive" function calls.
This is in itself a good enough reason, but this is a relatively simple case. What about the
situation where you have a "correlated" function call? Well, there are optimisations here as well,
but they're a bit more subtle. Let's run our example again, but this time, reference in the
function call a value from the query itself, rather than just "x". Note, i've modified the query
here such that we get 10 rows but only 2 distinct values. The reason for this will become obvious :
SQL> exec DBMS_APPLICATION_INFO.SET_CLIENT_INFO(0);
PL/SQL procedure successfully completed.
SQL> SELECT f_subq_test(i.rn)
2 FROM ( SELECT MOD(rownum, 2) rn FROM dual CONNECT BY LEVEL <= 10 ) i;
F_SUBQ_TEST(I.RN)
--------------------------------------------------------------------------------
1
0
1
0
1
0
1
0
1
0
10 rows selected.
SQL> SELECT client_info FROM v$session WHERE audsid = USERENV('SESSIONID');
CLIENT_INFO
----------------------------------------------------------------
10
Okay, that's what we'd expect, 10 function calls. Can we optimize this? Well, yeah, the same thing
applies, but the result is surprising :
SQL> exec DBMS_APPLICATION_INFO.SET_CLIENT_INFO(0);
PL/SQL procedure successfully completed.
SQL> SELECT ( SELECT f_subq_test(i.rn) FROM dual )
2 FROM ( SELECT MOD(rownum, 2) rn FROM dual CONNECT BY LEVEL <= 10 ) i;
(SELECTF_SUBQ_TEST(I.RN)FROMDUAL)
-------------------------------------------------------------------------------
1
0
1
0
1
0
1
0
1
0
10 rows selected.
SQL> SELECT client_info FROM v$session WHERE audsid = USERENV('SESSIONID');
CLIENT_INFO
----------------------------------------------------------------
2
Cool. We have 2 function calls, because there are two distinct "inputs" (in this case 1 and 0).
So, how many does this go up to? How many distinct "inputs" can there be? Okay, let's try various
scenarios. Note, each change to the query is simply a case of changing the 2 and 10 in the above
query.
As mentioned above, there is a tremendous difference also when dealing with different datatypes.
To highlight this, I've compared the f_subq_test function when using inputs and return
values as NUMBER and VARCHAR2.
Note, in order to get these values, i've used the following script. Note, that I ran this each time
twice after recompiling the function f_subq_test to change the datatypes of the input and return value.
SET ECHO OFF
VARIABLE num_rows NUMBER
VARIABLE distinct_vals NUMBER
EXEC :num_rows := 10; -- this is modified
EXEC :distinct_vals := 5; -- so is this
EXEC DBMS_APPLICATION_INFO.SET_CLIENT_INFO(0);
SET TERMOUT OFF
SELECT ( SELECT f_subq_test(i.rn) FROM dual )
FROM ( SELECT MOD(rownum, :distinct_vals) rn FROM dual CONNECT BY LEVEL <= :num_rows ) i
/
SET TERMOUT ON
SELECT client_info FROM v$session WHERE audsid = USERENV('SESSIONID')
/
I've collated example outputs into the following table, for both 10g instances and 9i instances. There
are substantial differences (deviations from the theoretical "ideal", i.e. number of distinct values,
are highlighted in bold) :
Rows |
Distinct Values |
Number of Function calls for VARCHAR2 (10g) |
Number of Function calls for VARCHAR2 (9i) |
Number of Function calls for NUMBER (10g) |
Number of Function calls for NUMBER (9i) |
16 |
7 |
7 |
7 |
7 |
7 |
16 |
8 |
9 |
8 |
8 |
8 |
16 |
9 |
10 |
9 |
9 |
9 |
20 |
10 |
12 |
10 |
10 |
10 |
29 |
20 |
21 |
20 |
20 |
20 |
30 |
20 |
22 |
21 |
20 |
20 |
100 |
50 |
85 |
56 |
50 |
50 |
270 |
200 |
254 |
|
200 |
200 |
274 |
200 |
258 |
|
200 |
200 |
275 |
200 |
259 |
|
201 |
|
300 |
200 |
284 |
|
206 |
|
500 |
200 |
468 |
|
225 |
|
So, what's going on here? Well, it's obvious that there's some kind of caching algorithm occurring
which changes behaviour from 9i to 10g. It's also obvious that there is a radical difference between
how this algorithm behaves when dealing with numeric or character datatypes, with numerics "winning
the battle of efficiency". At 10g, with numerics, it's not until 275 rows with 200 distinct values that
the number of function calls deviates from the theoretical ideal, whereas with VARCHAR2 it starts
at 16 rows and 8 distinct values. It is also interesting to note that there is no difference in
these figures with the "sizes" of the datatypes, i.e. much bigger character values or numeric
values. It's also interesting to note that the order of the initial data can also have a big effect
on the efficiency of the caching algorithm. According to this
AskTom extensive forum entry (search for "caching"), it's definitely fair to say that you'll get the most
efficiency if your data is ordered relative to the inputs of the function.
On this very subject, both Jonathan Lewis and Tom Kyte had a Usenet discussion, early in 2005.
You can see the conversation here,
but the upshot of it is that Oracle is building an internal hash table of a fixed size for
the different datatypes, when dealing with scalar subqueries.
So, what can we say? Well, we can say that Oracle will "cache" and reuse function calls
written in scalar subqueries for every distinct function "input" up to a certain point.
This is a stunning revelation. Remember all those queries you have where a PL/SQL function call does
a lookup for an input which is repeated "n" times? Well, now we have a technique where we can
cache and reuse those values. You may not get the theoretical minimum function calls, but there will
certainly be a reduction in the number of times the function gets called.
Reduction in "context switching"
Context switching is a term used when an environment has to accede to another environment. An example
would be a SQL statement calling a PL/SQL function. The SQL "environment" has to call PL/SQL in order
to evaluate. In Oracle, this operation is relatively expensive, so any mechanism for reducing the
number of context switches should be beneficial.
For this, I'll revert to my old friend, runstats. Note, I use a modified version of the
original RunStats, developed
by Tom Kyte, but the output is pretty much the same, I just split the output into several "reports".
In order to test this, I'll create the simplest case possible. A PL/SQL function which issues a
SELECT from dual compared with a scalar subquery, i.e.
SQL> CREATE OR REPLACE FUNCTION f_cs_test RETURN VARCHAR2
2 AS
3 l_dummy VARCHAR2(1);
4 BEGIN
5 SELECT dummy
6 INTO l_dummy
7 FROM dual;
8
9 RETURN l_dummy;
10 END f_cs_test;
11 /
Function created.
So, for the purposes of this example, I'm comparing :
SQL> SELECT f_cs_test
2 FROM dual;
F_CS_TEST
---------------------------
X
1 row selected.
With
SQL> SELECT ( SELECT dummy FROM dual )
2 FROM dual;
(
-
X
Both of which produce the same output.
The runstats analysis, at 10g, shows the following radical differences between calling a PL/SQL
function ("Run1") and a scalar subquery ("Run2") :
Name Run1 Run2 Diff
-----------------------------------------------------------------------------------
LATCH.simulator hash latch 13 18 5
LATCH.simulator lru latch 13 18 5
STAT...consistent changes 14 21 7
STAT...db block gets 14 23 9
STAT...db block changes 22 31 9
STAT...db block gets from cache 14 23 9
STAT...session logical reads 321 334 13
LATCH.cache buffers chains 684 705 21
STAT...undo change vector size 1,988 2,064 76
STAT...recursive calls 202 102 -100
STAT...calls to get snapshot scn: kcmgss 401 301 -100
STAT...execute count 201 101 -100
STAT...redo size 2,564 2,772 208
LATCH.library cache pin 604 204 -400
LATCH.library cache 705 205 -500
*** Latch Report Summary ***
Description Run1 Run2 Diff Pct (%)
----------------------------------------------------------------------------------------------
Total Latches 2,043 1,179 -864 173
As you can see, there is significantly less work done by Oracle using the scalar subquery, drops
in library cache activity, and of course, the optimisations on the execute counts themselves.
Using scalar subqueries to return multiple values
This is an extremely important section, since it describes how to use scalar subqueries to return
more than one value. The definition of a scalar subquery is that it can be used to return a single
scalar value only, i.e. attempting to do the following results in error :
SQL> SELECT ( SELECT dummy, dummy FROM dual )
2 FROM dual;
SELECT ( SELECT dummy, dummy FROM dual )
*
ERROR at line 1:
ORA-00913: too many values
However, it may be required to do this, especially in the case of
removing outer joins. Let's say you have the following setup :
SQL> CREATE TABLE t1 ( a VARCHAR2(10), b VARCHAR2(10) );
Table created.
SQL> CREATE TABLE t2 ( b VARCHAR2(10), b_desc VARCHAR2(10), b_value VARCHAR2(10) );
Table created.
SQL> INSERT INTO t1 ( a, b ) VALUES ( 'A1', 'B1' );
1 row created.
SQL> INSERT INTO t1 ( a, b ) VALUES ( 'A1', 'B2' );
1 row created.
SQL> INSERT INTO t2 ( b, b_desc, b_value ) VALUES ( 'B1', 'B1 DESC', 'B1 VALUE' );
1 row created.
Now, the requirement is to display for every row in t1, the values of b_desc and b_value in t2, but
t2 would traditionally need an outer join, since it could be deficient. The query you would probably
come up with would be :
SQL> SELECT t1.a, t1.b, t2.b_desc, t2.b_value
2 FROM
3 t1
4 LEFT OUTER JOIN t2 ON ( t2.b = t1.b );
A B B_DESC B_VALUE
---------- ---------- ---------- ----------
A1 B1 B1 DESC B1 VALUE
A1 B2
2 rows selected.
Which, of course, is perfectly reasonable. However, as mentioned in a previous section, it is possible
that the existence of outer joins, for example, could lead to an inefficient plan, especially where
complex queries are involved. So, is there an alternative to this? Well, most people would say no,
since we need two columns from t2, HOWEVER, there is a way, and it uses the Object relational features
of Oracle.
Remember, that a scalar subquery returns a single scalar value, however, there's nothing to stop
this scalar "value" being an instance of an OBJECT type, i.e.
SQL> CREATE OR REPLACE TYPE t2_obj AS OBJECT ( b_desc VARCHAR2(10), b_value VARCHAR2(10) );
2 /
Type created.
SQL> SELECT
2 x.a,
3 x.b,
4 x.t2_obj.b_desc,
5 x.t2_obj.b_value
6 FROM
7 ( SELECT t1.a, t1.b, ( SELECT t2_obj(t2.b_desc, t2.b_value)
8 FROM t2
9 WHERE t2.b = t1.b ) t2_obj
10 FROM t1 ) x;
A B T2_OBJ.B_D T2_OBJ.B_V
---------- ---------- ---------- ----------
A1 B1 B1 DESC B1 VALUE
A1 B2
2 rows selected.