Scalar Subqueries


What are scalar subqueries?

Scalar subqueries are a great feature of Oracle, and is another example of how Oracle's SQL implementation really does set itself apart from all the rest. Scalar subqueries allow you to embed SQL statements which return a scalar value "within" SQL statements, otherwise known as "SELECTing a SELECT".

The simplest example is something like this :
SQL> SELECT
  2    ( SELECT sysdate FROM dual )
  3  FROM dual;

(SELECTSY
---------
07-MAR-06
So, as you can see, no longer do you need PL/SQL functions to issue a cursor and return a value, you can embed it directly in the SELECT statement.

So, what advantages does this bring? Well, there are many, from the simplifying of cursors, to the reduction of
context switching. I go through the major reasons why you may want to consider this type of approach in the following sections.

An alternative to OUTER JOINs

Okay, imagine the situation, you've got a cursor, driving off "t1" and you need to join to "t2" to get some lookup value, BUT you realise that you actually need an OUTER JOIN to "t2", since it may be deficient. So, what do you do? Well, you could do this :
SQL> CREATE TABLE t1 ( a   VARCHAR2(10), b   VARCHAR2(10) );

Table created.

SQL> CREATE TABLE t2 ( b   VARCHAR2(10), b_desc   VARCHAR2(100) );

Table created.

SQL> INSERT INTO t1 ( a, b ) VALUES ( 'A1', 'B1' );

1 row created.

SQL> INSERT INTO t1 ( a, b ) VALUES ( 'A1', 'B2' );

1 row created.

SQL> INSERT INTO t2 ( b, b_desc ) VALUES ( 'B1', 'B1 DESCRIPTION' );

1 row created.

SQL> SELECT t1.a, t1.b, t2.b_desc
  2    FROM
  3      t1
  4      LEFT OUTER JOIN t2 ON ( t2.b = t1.b );

A          B          B_DESC
---------- ---------- --------------------
A1         B1         B1 DESCRIPTION
A1         B2
And there's absolutely nothing wrong with this. However, there are situations where implementing an OUTER JOIN may negatively impact performance, especially on queries more complex than this one. With scalar subqueries, this can be rewritten as :
SQL> SELECT t1.a, t1.b, ( SELECT t2.b_desc FROM t2 WHERE t2.b = t1.b ) b_desc
  2    FROM t1;

A          B          B_DESC
---------- ---------- --------------------
A1         B1         B1 DESCRIPTION
A1         B2
Taking advantage of the fact that scalar subqueries return NULL when there's "no data found". There have been countless situations where this type of approach (especially when coupled with
Scalar subquery caching, which we'll come on to next) has been the cause of significant performance enhancements. Oracle can simply come up with a more efficient plan when the OUTER JOIN(s) are "removed".

Note, see Using scalar subqueries to return multiple values to understand how to use this technique when you are referencing more than one column from the deficient table.

Scalar subquery "caching"

This is perhaps (to me) the most important, and surprising, advantage of using scalar subqueries, the ability of Oracle to "cache" the results of a scalar subquery and reuse the value, even though the query "should" be called more than once. It's easiest to explain with an example. For this, I need a function which is specifically written to record how many times the function is called. This function does this by incrementing the v$session column CLIENT_INFO via the built-in Oracle package DBMS_APPLICATION_INFO.

Note, there is an important note later on regarding the datatype of the inputs to the function.
SQL> CREATE OR REPLACE FUNCTION f_subq_test( p_in IN NUMBER )
  2    RETURN NUMBER
  3  AS
  4  BEGIN
  5    DBMS_APPLICATION_INFO.SET_CLIENT_INFO(USERENV('client_info')+1);
  6    RETURN p_in;
  7  END f_subq_test;
  8  /

Function created.
Okay, now we can begin. To prove the function does what it should, we'll call it in a SQL statement which returns 10 rows :
SQL> exec DBMS_APPLICATION_INFO.SET_CLIENT_INFO(0);

PL/SQL procedure successfully completed.

SQL> SELECT f_subq_test(1) FROM ( SELECT null FROM dual CONNECT BY LEVEL <= 10 );

F_SUBQ_TEST(1)
--------------------------------------------------------------------------------
1
1
1
1
1
1
1
1
1
1

10 rows selected.

SQL> SELECT client_info FROM v$session WHERE audsid = USERENV('SESSIONID');

CLIENT_INFO
----------------------------------------------------------------
10
Okay, so it works OK. But, consider for a moment, what this would mean in a production environment. Basically, the function has the same inputs ("x") and returns the same outputs ("x"), this function would be what is called a deterministic function. In this situation, though, this is grossly inefficient. The function is called 10 times, even though it's doing the same thing each time. Compare this with any of your functions in SELECT statements (or WHERE clauses etc.), which return the same value for the same input.

How can we optimise this? Well, it should be obvious where I'm going on this. Wrapping the function call in a scalar subquery actually causes Oracle to cache the results and reuse them :
SQL> SELECT ( SELECT f_subq_test(1) FROM dual )
  2    FROM ( SELECT null FROM dual CONNECT BY LEVEL <= 10 );

(SELECTF_SUBQ_TEST(1)FROMDUAL)
-------------------------------------------------------------------------------
1
1
1
1
1
1
1
1
1
1

10 rows selected.

SQL> SELECT client_info FROM v$session WHERE audsid = USERENV('SESSIONID');

CLIENT_INFO
----------------------------------------------------------------
1
So, suddenly we have a single function call for the whole query. The effect of this simple change can be dramatic, even for relatively "inexpensive" function calls.

This is in itself a good enough reason, but this is a relatively simple case. What about the situation where you have a "correlated" function call? Well, there are optimisations here as well, but they're a bit more subtle. Let's run our example again, but this time, reference in the function call a value from the query itself, rather than just "x". Note, i've modified the query here such that we get 10 rows but only 2 distinct values. The reason for this will become obvious :
SQL> exec DBMS_APPLICATION_INFO.SET_CLIENT_INFO(0);

PL/SQL procedure successfully completed.

SQL> SELECT f_subq_test(i.rn)
  2    FROM ( SELECT MOD(rownum, 2) rn FROM dual CONNECT BY LEVEL <= 10 ) i;

F_SUBQ_TEST(I.RN)
--------------------------------------------------------------------------------
1
0
1
0
1
0
1
0
1
0

10 rows selected.

SQL> SELECT client_info FROM v$session WHERE audsid = USERENV('SESSIONID');

CLIENT_INFO
----------------------------------------------------------------
10
Okay, that's what we'd expect, 10 function calls. Can we optimize this? Well, yeah, the same thing applies, but the result is surprising :
SQL> exec DBMS_APPLICATION_INFO.SET_CLIENT_INFO(0);

PL/SQL procedure successfully completed.

SQL> SELECT ( SELECT f_subq_test(i.rn) FROM dual )
  2    FROM ( SELECT MOD(rownum, 2) rn FROM dual CONNECT BY LEVEL <= 10 ) i;

(SELECTF_SUBQ_TEST(I.RN)FROMDUAL)
-------------------------------------------------------------------------------
1
0
1
0
1
0
1
0
1
0

10 rows selected.

SQL> SELECT client_info FROM v$session WHERE audsid = USERENV('SESSIONID');

CLIENT_INFO
----------------------------------------------------------------
2
Cool. We have 2 function calls, because there are two distinct "inputs" (in this case 1 and 0).

So, how many does this go up to? How many distinct "inputs" can there be? Okay, let's try various scenarios. Note, each change to the query is simply a case of changing the 2 and 10 in the above query.

As mentioned above, there is a tremendous difference also when dealing with different datatypes. To highlight this, I've compared the f_subq_test function when using inputs and return values as NUMBER and VARCHAR2.

Note, in order to get these values, i've used the following script. Note, that I ran this each time twice after recompiling the function f_subq_test to change the datatypes of the input and return value.
SET ECHO OFF

VARIABLE num_rows NUMBER
VARIABLE distinct_vals  NUMBER

EXEC :num_rows := 10;      -- this is modified
EXEC :distinct_vals := 5;  -- so is this

EXEC DBMS_APPLICATION_INFO.SET_CLIENT_INFO(0);

SET TERMOUT OFF
SELECT ( SELECT f_subq_test(i.rn) FROM dual )
  FROM ( SELECT MOD(rownum, :distinct_vals) rn FROM dual CONNECT BY LEVEL <= :num_rows ) i
/
SET TERMOUT ON

SELECT client_info FROM v$session WHERE audsid = USERENV('SESSIONID')
/
I've collated example outputs into the following table, for both 10g instances and 9i instances. There are substantial differences (deviations from the theoretical "ideal", i.e. number of distinct values, are highlighted in bold) :

Rows Distinct Values Number of Function calls for VARCHAR2 (10g) Number of Function calls for VARCHAR2 (9i) Number of Function calls for NUMBER (10g) Number of Function calls for NUMBER (9i)
16 7 7 7 7 7
16 8 9 8 8 8
16 9 10 9 9 9
20 10 12 10 10 10
29 20 21 20 20 20
30 20 22 21 20 20
100 50 85 56 50 50
270 200 254 200 200
274 200 258 200 200
275 200 259 201
300 200 284 206
500 200 468 225

So, what's going on here? Well, it's obvious that there's some kind of caching algorithm occurring which changes behaviour from 9i to 10g. It's also obvious that there is a radical difference between how this algorithm behaves when dealing with numeric or character datatypes, with numerics "winning the battle of efficiency". At 10g, with numerics, it's not until 275 rows with 200 distinct values that the number of function calls deviates from the theoretical ideal, whereas with VARCHAR2 it starts at 16 rows and 8 distinct values. It is also interesting to note that there is no difference in these figures with the "sizes" of the datatypes, i.e. much bigger character values or numeric values. It's also interesting to note that the order of the initial data can also have a big effect on the efficiency of the caching algorithm. According to
this AskTom extensive forum entry (search for "caching"), it's definitely fair to say that you'll get the most efficiency if your data is ordered relative to the inputs of the function.

On this very subject, both Jonathan Lewis and Tom Kyte had a Usenet discussion, early in 2005. You can see the conversation here, but the upshot of it is that Oracle is building an internal hash table of a fixed size for the different datatypes, when dealing with scalar subqueries.

So, what can we say? Well, we can say that Oracle will "cache" and reuse function calls written in scalar subqueries for every distinct function "input" up to a certain point. This is a stunning revelation. Remember all those queries you have where a PL/SQL function call does a lookup for an input which is repeated "n" times? Well, now we have a technique where we can cache and reuse those values. You may not get the theoretical minimum function calls, but there will certainly be a reduction in the number of times the function gets called.

Reduction in "context switching"

Context switching is a term used when an environment has to accede to another environment. An example would be a SQL statement calling a PL/SQL function. The SQL "environment" has to call PL/SQL in order to evaluate. In Oracle, this operation is relatively expensive, so any mechanism for reducing the number of context switches should be beneficial.

For this, I'll revert to my old friend, runstats. Note, I use a modified version of the original
RunStats, developed by Tom Kyte, but the output is pretty much the same, I just split the output into several "reports".

In order to test this, I'll create the simplest case possible. A PL/SQL function which issues a SELECT from dual compared with a scalar subquery, i.e.
SQL> CREATE OR REPLACE FUNCTION f_cs_test RETURN VARCHAR2
  2  AS
  3    l_dummy   VARCHAR2(1);
  4  BEGIN
  5    SELECT dummy
  6      INTO l_dummy
  7      FROM dual;
  8
  9    RETURN l_dummy;
 10  END f_cs_test;
 11  /

Function created.
So, for the purposes of this example, I'm comparing :
SQL> SELECT f_cs_test
  2    FROM dual;

F_CS_TEST
---------------------------
X

1 row selected.
With
SQL> SELECT ( SELECT dummy FROM dual )
  2    FROM dual;

(
-
X
Both of which produce the same output.

The runstats analysis, at 10g, shows the following radical differences between calling a PL/SQL function ("Run1") and a scalar subquery ("Run2") :
Name                                                     Run1       Run2       Diff
-----------------------------------------------------------------------------------
LATCH.simulator hash latch                                 13         18          5
LATCH.simulator lru latch                                  13         18          5
STAT...consistent changes                                  14         21          7
STAT...db block gets                                       14         23          9
STAT...db block changes                                    22         31          9
STAT...db block gets from cache                            14         23          9
STAT...session logical reads                              321        334         13
LATCH.cache buffers chains                                684        705         21
STAT...undo change vector size                          1,988      2,064         76
STAT...recursive calls                                    202        102       -100
STAT...calls to get snapshot scn: kcmgss                  401        301       -100
STAT...execute count                                      201        101       -100
STAT...redo size                                        2,564      2,772        208
LATCH.library cache pin                                   604        204       -400
LATCH.library cache                                       705        205       -500

*** Latch Report Summary ***
Description                                              Run1       Run2       Diff    Pct (%)
----------------------------------------------------------------------------------------------
Total Latches                                           2,043      1,179       -864        173
As you can see, there is significantly less work done by Oracle using the scalar subquery, drops in library cache activity, and of course, the optimisations on the execute counts themselves.

Using scalar subqueries to return multiple values

This is an extremely important section, since it describes how to use scalar subqueries to return more than one value. The definition of a scalar subquery is that it can be used to return a single scalar value only, i.e. attempting to do the following results in error :
SQL> SELECT ( SELECT dummy, dummy FROM dual )
  2    FROM dual;
SELECT ( SELECT dummy, dummy FROM dual )
         *
ERROR at line 1:
ORA-00913: too many values
However, it may be required to do this, especially in the case of
removing outer joins. Let's say you have the following setup :
SQL> CREATE TABLE t1 ( a VARCHAR2(10), b VARCHAR2(10) );

Table created.

SQL> CREATE TABLE t2 ( b VARCHAR2(10), b_desc   VARCHAR2(10),   b_value  VARCHAR2(10) );

Table created.

SQL> INSERT INTO t1 ( a, b ) VALUES ( 'A1', 'B1' );

1 row created.

SQL> INSERT INTO t1 ( a, b ) VALUES ( 'A1', 'B2' );

1 row created.

SQL> INSERT INTO t2 ( b, b_desc, b_value ) VALUES ( 'B1', 'B1 DESC', 'B1 VALUE' );

1 row created.
Now, the requirement is to display for every row in t1, the values of b_desc and b_value in t2, but t2 would traditionally need an outer join, since it could be deficient. The query you would probably come up with would be :
SQL> SELECT t1.a, t1.b, t2.b_desc, t2.b_value
  2    FROM
  3      t1
  4      LEFT OUTER JOIN t2 ON ( t2.b = t1.b );

A          B          B_DESC     B_VALUE
---------- ---------- ---------- ----------
A1         B1         B1 DESC    B1 VALUE
A1         B2

2 rows selected.
Which, of course, is perfectly reasonable. However, as mentioned in a previous section, it is possible that the existence of outer joins, for example, could lead to an inefficient plan, especially where complex queries are involved. So, is there an alternative to this? Well, most people would say no, since we need two columns from t2, HOWEVER, there is a way, and it uses the Object relational features of Oracle.

Remember, that a scalar subquery returns a single scalar value, however, there's nothing to stop this scalar "value" being an instance of an OBJECT type, i.e.
SQL> CREATE OR REPLACE TYPE t2_obj AS OBJECT ( b_desc  VARCHAR2(10), b_value  VARCHAR2(10) );
  2  /

Type created.

SQL> SELECT
  2    x.a,
  3    x.b,
  4    x.t2_obj.b_desc,
  5    x.t2_obj.b_value
  6  FROM
  7    ( SELECT t1.a, t1.b, ( SELECT t2_obj(t2.b_desc, t2.b_value)
  8                             FROM t2
  9                            WHERE t2.b = t1.b ) t2_obj
 10        FROM t1 ) x;

A          B          T2_OBJ.B_D T2_OBJ.B_V
---------- ---------- ---------- ----------
A1         B1         B1 DESC    B1 VALUE
A1         B2

2 rows selected.