Discussion:
QEP/RFC sqlite virtual tables
Matthias Kuhn
2014-10-22 09:21:33 UTC
Permalink
Moving this to a new thread (from 1:many relation...
Well, QgsExpression has been designed to mimick the SQL WHERE clause, so
yes I agree.
My point is that if you want to extend expression to support joins,
aggregates, window functions, etc. you end up recreating SQL, but
probably a less powerful one.
You can call it mimicking, you can also call it implementing it's own
SQL (subset) dialect.
Yes, you end up recreating SQL. How powerful it will be is hard to say
right now, right?
The difference I can see between expressions and SQL is that functions
used in expressions may have no direct equivalent SQLite-side. But it
can be resolved by using user-defined functions in SQLite (that would
just call code from QgsExpression for instance), so that the SQL dialect
would be seen as an extension of the current QgsExpression, not
something different.
:)
I don't see anything that prevents virtual layer queries from being
optimized server-side. It has to be taken into account during the
implementation of virtual layers, of course, but it is hard to exhibit
the precise way it will work now without having an almost complete
implementation.
I was referring to this discussion here.
https://github.com/qgis/QGIS-Enhancement-Proposals/pull/5#issuecomment-5814
8788
Sorry I did not respond again, but I don't think a complete optimization
is possible without having deep access to virtual table functionality in
sqlite (or having to parse the same SQL on our end and replace parts of
it which basically comes close to writing our own engine again).
But let's discuss this on the QEP PR instead of inside this ml thread.
Just to give you my feeback based on my implementation of SQLite SQL dialect
in OGR (
https://github.com/OSGeo/gdal/blob/trunk/gdal/ogr/ogrsf_frmts/sqlite/ogrsqlitevirtualogr.cpp
), I also doubt that you could easily inform the server that a join should be
done. AFAIR I could forward simple filtering to the server/backend (things like
"column_A >= 5 AND column_A <= 10 AND column_B != 10") by implementing the
xFilter virtual method, but Virtual Tables are not informed of joins happening
(the sqlite virtual table API hardy exposes SQL at all). So that would indeed
require analyzing the SQL on QGIS side.
AFAIR, I've just verified that JOIN like "t1.col1 = t2.col2" are seen by the
virtual table implementation like successive filters "t1.col1 = val1", "t1.col1
= val2", where val1, val2 are values from t2.col2.
And regarding xFilter, even filters with OR etc.. are hidden to the virtual
table implementation and evaluated only by SQLite.
Thanks Even for your feedback.
The idea under virtual layers is that SQLite may be exposed to advanced
users if they want to use a complex query on some data sources.
It is very interesting to use SQLite to replace existing features from a
code point of view because it simplifies implementation.
But exposing a SQL query for already existing features such as joins may
not be desirable for standard users.
So for joins, since the information comes from QGIS ("I want to join
tables t1 and t2 on this column predicate"), the SQL query can be
deduced and some other metadata is also known, like the fact that t1 and
t2 may be from the very same database.
In that case the virtual layer may be created with some metadata saying
that this is in fact a join that must be optimized. And then the virtual
layer provider can ask the underlying provider (say postgis) if it can
process the whole join query.
I am not only referring to the current relations implementation here. I
don't think aggregate functions need necessarily be based on creating a
relation definition in the project file (Just like in a database you
also don't need a foreign key constraint to create join).

But if I want to create a new layer or an ad-hoc iterator that joins
data or makes use of other advanced relational database features, it
would be nice if QGIS could forward such requests to the database. Say I
want to create a symbology based on the amount of waste within
countries, QGIS should be able to execute this request on the database.
In this scenario you could likely want to define this request inside an
expression (e.g. a virtual column).
So basically, I think we need a functionality that is able to work as a
database abstraction layer. Forward wherever possible to the db, use a
(local) fallback implementation where not.

I am not opposed to use SQLite for this. There is no reason for us to
reinvent the wheel, the NIH syndrome should not be our driver. But I see
some limitations and I would rather prefer to have these properly
addressed. And I will be more than happy if you tell me that you
convinced the sqlite virtual table developers that they should extend
their API to expose the parsed SQL structure and allow to tweak it. Then
we have a fallback implementation which is well-tested but we still have
the possibility to fine-tune requests for big databases.

Regards,
Matthias
--
Help getting QGIS to the next level of quality before November 15!
http://blog.vitu.ch/10102014-1046/crowdfunding-initiative-automated-testing
Hugo Mercier
2014-10-22 12:11:37 UTC
Permalink
Post by Matthias Kuhn
Well, QgsExpression has been designed to mimick the SQL WHERE clause, so
yes I agree.
My point is that if you want to extend expression to support joins,
aggregates, window functions, etc. you end up recreating SQL, but
probably a less powerful one.
You can call it mimicking, you can also call it implementing it's own
SQL (subset) dialect.
Sorry for my english, "mimicking" was not a negative term for me.
Post by Matthias Kuhn
Yes, you end up recreating SQL. How powerful it will be is hard to say
right now, right?
Sure. But if you want it to be as powerful as other SQL engines, lots of
work is needed (parsing, planning, optimizing). Work that has already
been done, I think.
Post by Matthias Kuhn
So for joins, since the information comes from QGIS ("I want to join
tables t1 and t2 on this column predicate"), the SQL query can be
deduced and some other metadata is also known, like the fact that t1 and
t2 may be from the very same database.
In that case the virtual layer may be created with some metadata saying
that this is in fact a join that must be optimized. And then the virtual
layer provider can ask the underlying provider (say postgis) if it can
process the whole join query.
I am not only referring to the current relations implementation here. I
don't think aggregate functions need necessarily be based on creating a
relation definition in the project file (Just like in a database you
also don't need a foreign key constraint to create join).
Not sure to understand. You mean there would be no need to create a
virtual layer just to execute an aggregate ? Exact.
But if the virtual layer concept is complete enough, then every single
layer can be seen as a special case of a (trivial) virtual layer, then
to create an aggregate, you won't need to create a new virtual layer.
Post by Matthias Kuhn
But if I want to create a new layer or an ad-hoc iterator that joins
data or makes use of other advanced relational database features, it
would be nice if QGIS could forward such requests to the database. Say I
want to create a symbology based on the amount of waste within
countries, QGIS should be able to execute this request on the database.
In this scenario you could likely want to define this request inside an
expression (e.g. a virtual column).
So basically, I think we need a functionality that is able to work as a
database abstraction layer. Forward wherever possible to the db, use a
(local) fallback implementation where not.
I agree, but can you be more specific on your example ? How many tables
involved ? Which fields ? What kind of joins ?
Post by Matthias Kuhn
I am not opposed to use SQLite for this. There is no reason for us to
reinvent the wheel, the NIH syndrome should not be our driver. But I see
some limitations and I would rather prefer to have these properly
addressed. And I will be more than happy if you tell me that you
convinced the sqlite virtual table developers that they should extend
their API to expose the parsed SQL structure and allow to tweak it. Then
we have a fallback implementation which is well-tested but we still have
the possibility to fine-tune requests for big databases.
Splitting the original query in local and remote queries seems to me
very hard in the general case. It would require not only the parsed SQL
AST, but some information about the execution plan. And the plan may
depends on statistics on the data. And even when you know the plan, I
think it is still hard to translate.

Generally speaking if the user tries to do a very complex query with
lots of different data sources, it is hard to guarantee good
performances, but I don't see it as a major problem. If you want
something complex AND fast, then use SQLite or PostGIS.

So, I would vote for optimizations that can be done for simple, defined
scenarios, like the "join" case given by the current join or relation
feature of QGIS : no nested queries, no group by, no order by, no CTE,
etc. That particular case could be detected when the join is created,
and optimizations can be created, like creating a (temporary) postgis
layer with sql=the join query.

I may be missing something. Are there other simple cases where the
optimization is well defined ?

To put it differently, I don't see the lack of knowledge about parsed
SQL query as a blocker for the implementation of this virtual layer
concept. A general cross-database SQL optimizer would need lots of work
(in QGIS and/or SQLite) and may come later (if really needed).

The other option would be to say we don't need everything SQLite
provides, and we won't never need it. So then we can define our own
subset of SQL and it would probably be easier to optimize. But I can't
see why we would want to limit the features :) it is usually the opposite.
Loading...