NOT IN子查询中出现NULL值对结果的影响你注意到了吗-Toy模板网

这篇具有很好参考价值的文章主要介绍了NOT IN子查询中出现NULL值对结果的影响你注意到了吗。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

前言

开发人员写的SQL语句中经常会用到in，exists，not in，not exists 这类子查询，通常，含in、exists的子查询称为半连接（semijoin），含not in、 not exists的子查询被称之为反连接，经常会有技术人员来评论in 与exists 效率孰高孰低的问题，我在SQL优化工作中也经常对这类子查询做优化改写，比如半连接改为内连接，反连接改为外连接等，哪个效率高是要根据执行计划做出判断的，本文不是为了讨论效率问题，是要提醒一点：not in子查询的结果集含NULL值时，会导致整个语句结果集返回空，这可能造成与SQL语句书写初衷不符。

实验

创建实验表t1，t2

greatsql> create table t1(c1 int primary key,c2 varchar(10), key idx_c1(c2));
greatsql> create table t2(c1 int primary key,c2 varchar(10)),key idx_c1(c2));

greatsql> insert into t1 values(1,'a'),(2,'b');
greatsql> insert into t2 values(1,'a'),(2,'c');

观察下面两条语句：

select * from t1 where t1.c2 not in (select t2.c2 from t2);

select * from t1 where not exists (select 1 from t2 where t2.c2=t1.c2);

这两个语句，从表达的含义来看是等价的，都是查询t1表中c2列值在t2表的c2列值中不存在的记录。

从子查询类型来看，第一条语句属于非关联查询，第二条语句属于关联子查询。所谓非关联子查询就是子查询中内查询可以独立执行，与外查询没有关系，互不影响。而关联子查询的执行依赖于外部查询，通常情况下都是因为子查询中的表用到了外部的表，并进行了条件关联，因此每执行一次外部查询，子查询都要重新计算一次。

从连接类型来看，使用not in与not exists子查询构造的语句都属于反连接，为了控制连接顺序与连接方式，这种反连接经常被改写为外连接，t1 与t2使用左外连接，条件加上右表t2的连接列 is null，也就是左外连接时没有关联上右表的数据，表达了这个含义“t1表中c2列值在t2表的c2列值中不存在的记录”。反连接改写为外连接，不会导致关联结果集放大，因为没有关联上的t1表数据只显示1条，半连接改为内连接时要注意去重。外连接语句如下所示：

greatsql> select t1.* from t1 left join t2 on t1.c2=t2.c2 where t2.c2 is null;

所以本质表达含义上，上面的三条语句都等价。

下面看一下三条语句的执行结果：

greatsql> select * from t1 where t1.c2 not in (select t2.c2 from t2);
+----+------+
| c1 | c2   |
+----+------+
|  2 | b    |
+----+------+
1 row in set (0.00 sec)

greatsql> select * from t1 where not exists (select 1 from t2 where t2.c2=t1.c2);
+----+------+
| c1 | c2   |
+----+------+
|  2 | b    |
+----+------+
1 row in set (0.01 sec)

greatsql> select t1.* from t1 left join t2 on t1.c2=t2.c2 where t2.c2 is null;
+----+------+
| c1 | c2   |
+----+------+
|  2 | b    |
+----+------+
1 row in set (0.00 sec)

可以看出就目前的数据，三条语句执行结果是相同的。

下面向子查询的t2中插入一条c2列为null的记录。

greatsql> insert into t2 values(3,null);

再观察一下三条语句的执行结果：

greatsql> select * from t1 where t1.c2 not in (select t2.c2 from t2);
Empty set (0.00 sec)

greatsql> select * from t1 where not exists (select 1 from t2 where t2.c2=t1.c2);
+----+------+
| c1 | c2   |
+----+------+
|  2 | b    |
+----+------+
1 row in set (0.00 sec)

greatsql> select t1.* from t1 left join t2 on t1.c2=t2.c2 where t2.c2 is null;
+----+------+
| c1 | c2   |
+----+------+
|  2 | b    |
+----+------+
1 row in set (0.00 sec)

可以看出，not exists表示的关联子查询与外连接方式表达的两条语句结果相同，而not in表示的非关联子查询的结果集为空。这是因为子查询select t2.c2 from t2 查询结果含有NULL值导致的。NULL属于未知值，无法与其他值进行比较，无从判断，返回最终结果集为空。这一点在MySQL与Oracle中返回结果都是一致的。如果想表达最初的含义，需要将子查询中NULL值去除。

greatsql> select * from t1 where t1.c2 not in (select t2.c2 from t2 where t2.c2 is not null);
+----+------+
| c1 | c2   |
+----+------+
|  2 | b    |
+----+------+
1 row in set (0.02 sec)

那么如果t1表的c2列也插入一条NULL值的记录后，结果集会怎样呢，两个表都存在c2列为NULL的值数据，那么t1表这条NULL值数据能否出现在最终结果集中呢？

greatsql> insert into t1 values(3,null);
Query OK, 1 row affected (0.07 sec)

greatsql> select * from t1 where t1.c2 not in (select t2.c2 from t2 where t2.c2 is not null);
+----+------+
| c1 | c2   |
+----+------+
|  2 | b    |
+----+------+
1 row in set (0.00 sec)

greatsql> select * from t1 where not exists (select 1 from t2 where t2.c2=t1.c2);
+----+------+
| c1 | c2   |
+----+------+
|  3 | NULL |
|  2 | b    |
+----+------+
2 rows in set (0.00 sec)

greatsql> select t1.* from t1 left join t2 on t1.c2=t2.c2 where t2.c2 is null;
+----+------+
| c1 | c2   |
+----+------+
|  3 | NULL |
|  2 | b    |
+----+------+
2 rows in set (0.00 sec)

从执行结果来看，使用not in非关联子查询，其执行结果与其他两条语句的执行结果还是不同，因为t1.c2 使用not in在参与比较时就隐含了t1.c2 is not null的含义，所以最终结果集中不含(3,NULL)这条数据。

而not exists关联子查询，在将外查询的NULL值传递给内查询时执行子查询 select * from t2 where t2.c2=NULL，子查询中找不到记录，所以条件返回false，表示not exists 为true，则最终结果集中含(3,NULL)这条记录。

左外left join 与 not exists相同，左表的NULL值在右表中关联不上数据，所以要返回（3,NULL)这条数据。这里要注意NULL 不等于 NULL。

greatsql> select NULL=NULL;
+-----------+
| NULL=NULL |
+-----------+
|      NULL |
+-----------+
1 row in set (0.01 sec)

说到这里，GreatSQL支持<=>安全等于这个符号，用来判断NULL值：当两个操作数均为NULL时，其返回值为1而不为NULL；而当一个操作数为NULL时，其返回值为0而不为NULL。

greatsql> select NULL<=>NULL;
+-------------+
| NULL<=>NULL |
+-------------+
|           1 |
+-------------+
1 row in set (0.00 sec)

greatsql> select 1<=>NULL;
+----------+
| 1<=>NULL |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

所以not exists 子查询中的= 换成 <=> 时，最终结果集中去除了(3,NULL)这条数据。

greatsql> select * from t1 where not exists (select 1 from t2 where t2.c2<=>t1.c2);
+----+------+
| c1 | c2   |
+----+------+
|  2 | b    |
+----+------+
1 row in set (0.00 sec)

注意，一般表关联时不使用<=>安全等于这个符号，想象一下，如果关联的两个表在关联字段上都存在很多NULL记录，关联后的结果集对NULL记录的关联是以笛卡尔积的形式体现的，严重影响效率，严格来说关联字段都为NULL值不能算作能匹配上。

结论

使用not in 的非关联子查询注意NULL值对结果集的影响，为避免出现空结果集，需要子查询中查询列加 is not null条件将NULL值去除。
实际使用时注意：需求表达的含义是否要将外查询关联字段值为NULL的数据输出，not in隐含了不输出。
一般认为not exists关联子查询与外连接语句是等价的，可以进行相互改写。

select * from t1 where not exists (select 1 from t2 where t2.c2=t1.c2);
   
select t1.* from t1 left join t2 on t1.c2=t2.c2 where t2.c2 is null;

如果不需要输出外查询中关联字段为NULL值的数据，还需再加条件 t1.c2 is not null。

select * from t1 where not exists (select 1 from t2 where t2.c2=t1.c2) and t1.c2 is not null;
   
select t1.* from t1 left join t2 on t1.c2=t2.c2 where t2.c2 is null and t1.c2 is not null;

这样写就与select * from t1 where t1.c2 not in (select t2.c2 from t2 where t2.c2 is not null)等价了。

Enjoy GreatSQL 😃