Ever since I wrote my first query for MySQL, nearly a decade ago, as far as
web development goes, I have wanted nothing more than to not write database
queries in SQL. I wouldn't say that I dislike the idea of SQL, even; I
just don't like writing queries. SQL is pretty flexible and extremely powerful,
but it's tedious. Queries read almost like English sentences, and despite the
repetition of writing nearly identical structure many times, SQL syntax errors
(magically) happen at a rate far higher than they do in my PHP or JavaScript
source code, likely due to the extensive amount of escaping and quoting that
happens while writing queries by hand. Within a couple years, I grew
tremendously as a PHP developer and learned about the concept of database
abstraction, but sadly it didn't solve my problem.
The Problem with Database Abstraction
Database abstraction seems like a great idea—it hides the details of the
database API from the user application. And, for this purpose, it is excellent.
Perhaps it was my naivety, but I expected more. I expected database abstraction
to abstract the syntax of SQL, not just hide the API for talking to
MySQL. During those first few years, I worked on several different projects,
and in almost every case, we used ad hoc solutions for hiding some of the
details of constructing queries, such as a select() method that accepts a WHERE
clause and fills in the rest of the query automatically. In the time since,
these types of solutions have been included in abstraction layers such as the
one provided by the Zend Framework, but they suffer from many problems.
Chief among these problems is incomplete SQL abstraction. This was a common
aspect of our ad hoc solutions in 2005, as we really just wanted to hide part
of the query syntax, but even modern solutions require the user to fill in some
portions of the SQL, although this is improving. The problem, however, is that
without complete SQL abstraction, the results aren't really portable.
The entire objective of abstracting the API is to provide backend portability
without requiring significant changes, yet any hand-written SQL or portions of
SQL provides an opportunity for using non-portable features of the particular
DBMS at hand. The other consequence of incomplete abstraction is that input
escaping must be done by the consuming code. This is easy, but it requires
extra code and provides extra opportunity to make mistakes.
The next problem is that a lot of solutions are verbose. Sometimes this is
due to the language syntax—like PHP's array syntax—but the library structure is
almost always a contributing factor. Granted, because we are doing abstraction,
some overhead is not unreasonable, but when you combine the source code
overhead with incomplete abstraction, it raises a valid question of the
utility: now you are writing more code than if you wrote SQL directly
plus if you change backends, you will need to inspect all of your code
anyway, and possibly make changes despite using an adapter.
Finally, most abstraction layers do not provide a means for simplifying and
standardizing the generation of frequently used queries or query fragments.
They may provide abstraction and query generation, but it must be repeated
wherever a query is generated. You may ease the pain by defining additional
functions that modify query objects to perform the repeated tasks or add groups
of conditions to where clauses, but this doesn't hook into the rest of the
abstraction in a clean way. Recently, I've come to learn of an approach known
as Object-Relational Mapping (ORM), which provides some of this capability, as
it provides a means of translating key/value maps into database objects and
vice versa, but it appears to suffer from limitations in the complexity of the
types of queries that can be created using the object mapping approach.
In 2006/2007, I began developing my own solution designed specifically to
address these problems, in PHP. I've actually written about the approach
before, on a blog long gone, where I posted some proto-code that formed the
core of what I called "database filtering" and used for about 3 years on all of
my PHP work. The premise is simple: the user constructs a map, in the form of a
PHP array, which specifies a filter—that is, a set of matching conditions—for
obtaining data from the database, and then passes it to a function that decodes
the filter, escapes all of the input, formats it according to the database
field types, and creates a complete query to obtain the data. This combined
features of ORM with a more general query generation technique, because the
filters did not actually reflect the structure of the table: they reflected the
structure of an abstract data interface, which could be extended to include
conversion rules for many different types of objects. An important aspect of my
approach was automatically doing "the right thing" by converting array
parameters to IN() conditions, but leaving scalars as simple equality. It
greatly simplified my code because I no longer had to worry about formatting my
data every time that I hit the database. Effectively, it formed an abstraction
layer for data retrieval, freeing me from thinking about SQL entirely. In fact,
aside from writing the primitives used by the filtering code, I didn't write
any SQL in my application logic at all.
Now that I've made the switch from PHP to Node.js, the first thing I needed was
a suitable database abstraction layer. I decided to use the excellent
node-mysql package for handling the
connection to MySQL, and I decided to port my old database filtering library
from PHP to Node.js, with many improvements, and release it on github. If you
hadn't guessed, the last six paragraphs were just background and motivation for
my actual purpose here: to introduce db-filters for
Node.js, explain why it is a good
solution to these problems, and ideally motivate you, the reader, to use it for
all of your SQL needs, and then convince your friends to use it as well. I
mentioned the Zend Framework earlier, but since I am developing a Node.js
solution, it is more useful to compare against Node.js Database
Drivers a database abstraction layer that provides query
generating functionality for MySQL and drizzle (which is really just also
MySQL, but the use of different libraries makes it a good example of how
current solutions are effective for API abstraction, but not language
abstraction).
How I'd Like to Solve the Problem
The primary goals of db-filters are specifically to address the three
problems outlined above, with many eventual stretch-type goals that may or may
not be feasible. I want to provide a solution that has a complete abstraction
of SQL, is simple to use, has a succinct yet flexible API, provides appropriate
parameter escaping without programmer action, and supports pseudo-fields (I
call them special keys) in table definitions that can be used to implement
frequently repeated object property decoding in a central location with proper
hooks into the SQL abstraction and filter decoding process. Additionally,
db-filters provides all of these capabilities to INSERT, UPDATE, and DELETE
queries as well, which is uncommon in other query-building abstraction layers.
Importantly, db-filters is not an API abstraction, although it is an effective
consequence, as it wraps around an underlying driver for communications. You
use it with a (presumably simpler) library that provides a means for sending
queries and retrieving results. With that in mind, let's take a look at some
example SQL and see how to generate it using a filter. All of the examples
assume that appropriate filter definitions exist for the tables being used.
Normally, filter definitions are initially created using the generation script
included with db-filters, and then any necessary special handlers are
implemented manually, so I will not include the definitions themselves.
We'll start with a pair of basic queries to illustrate the simplicity of the
API, as well as the inferred formatting. We have a table of users, and we wish
to retrieve all of the information for a specific user, and then we wish to
retrieve information for a group of users, using two queries:
SELECT * FROM `users` WHERE `id` = 1
SELECT * FROM `users` WHERE `id` IN (1, 2, 3, 4, 5)
Side note: all of the "expected SQL" I am providing is the exact output of
the filtering code given afterwards, except that I use the buildQuery()
method to obtain the SQL without sending it to the server. I provide db-mysql
comparison code in a few places, but the installation actually failed on my
system, so I am not able to guarantee that my comparison code is 100% correct
as it is based on the (limited) documentation available on their website, but
it should provide a good approximation of how much code must be written,
including how much SQL is hand-written. Assuming that we have a filter for the
users table stored in the users
variable, the code would look like this:
// Success and failure are callback functions invoked asynchronously with the results
// they forward the results of mysql.query exactly as given
users.select({id : 1}).exec(success, failure)
users.select({id : [1, 2, 3, 4, 5]}).exec(success, failure)
// Comparison code snippet using Node.js Database Drivers' db-mysql
conn.query().select('*')
.from('users')
.where('`id` = ?', [1])
.execute(callback);
this.query().select('*')
.from('users')
.where('`id` IN ?', [[1, 2, 3, 4, 5]])
.execute(callback)
At this point, you should probably not be impressed, but it should already be
apparent that db-filters provides a more effective way of creating queries.
db-mysql requires an extra method call, query()
, requires specifying the *
field manually, which is SQL-specific and can be inferred by not including
other field definitions, and requires partial SQL inclusion in the call to
where()
. Although it doesn't matter in this example, requiring you to write
out the contents of the WHERE clause defeats the point of abstracting the
backend, because you will be tempted to use MySQL extensions, and suddenly your
code isn't portable anyway. With the filtering approach, you write no SQL, you
write less code, you write simpler code, and the result is completely portable
to any database engines supported by the library. With that in mind, let's look
at a more complicated type of query, to see if these attributes extend to these
solutions as well. This time I have to write a query that that retrieves all
of the posts on a specific page in a thread, along with the user information
for the authors of those posts, ideally using a join. First, the SQL we expect
to see:
SELECT `p`.*, `u`.* FROM posts AS p LEFT JOIN users AS u ON `p`.`userId` = `u`.`id` WHERE `p`.`threadId` = 1 ORDER BY `p`.`posted` ASC LIMIT 60, 20
This is generated with another very straightforward Node.js snippet. It is
important to note that there are no SQL literals or directly-embedded strings
in this code, only function calls, table/field names, and parameters.
Comparable solutions often require ordering clauses such as posted ASC
or ON
clauses such as `p`.`userId` = `u`.`id`
, which seem fairly innocuous but
still serve to break the abstraction by requiring hand-written pieces of SQL.
// Using db-filters with users and posts as predefined filters
var page = 3;
var tId = 1;
var postsPerPage = 20;
posts.select({threadId : tId}, 'p')
.order(db.$asc('posted'))
.limit(page*postsPerPage, postsPerPage)
.left_join(users, 'u')
.on(['userId', 'id'])
.exec(success, failure);
// What db-mysql code might look like
this.query().select(['`p`.*', '`u`.*'])
.where('`p`.`threadId` = ?', [tId])
.join({type : 'LEFT', table : 'users', alias : 'u', conditions : '`p`.`userId` = `u`.`id`'})
.order({'`p`.`posted`' : true})
.limit(page*postsPerPage, postsPerPage)
.execute(callback)
This is where the difference is really apparent. db-mysql does not know how
to combine fields with tables, so all table names/aliases must be prepended
onto field names manually whenever they are used in the query. The syntax is
also very verbose, requiring 33% more typing despite a weaker abstraction. By
contrast, the db-filters code has much more readability to it, contains exactly
zero back ticks (`), and can in principle emit a valid query for any RDBMS,
because there are again no keywords or operators. These differences only become
more significant as we consider different situations. Let's look at one last
SELECT query, which will demonstrate how the function abstraction and special
fields work in a filter-based approach:
SELECT * FROM users WHERE `name` = 'greg' AND `password` = MD5(CONCAT(`salt`, 'pass')) LIMIT 1
This might be part of a simple salted, hashed login system. This isn't
exactly a robust login solution (use bcrypt instead, if you're building one at
home), but I'm not presenting a debate, I'm demonstrating how to use the
library, and this was the easiest type of demonstration to come up for calling
SQL functions.
var name = 'greg';
var pass = 'pass';
users.select({name : name, password : db.$md5(db.$concat(db.$field('salt'), pass))})
.limit(1)
.exec(success, failure);
// And again in db-mysql
this.query().select('*')
.from('users')
.where('`name` = ? AND `password` = MD5(CONCAT(`salt`, ?))', [name, pass])
.limit(1)
.execute(callback);
Using db-mysql, we are forced to write out the entire body of the WHERE clause
ourselves, including directly calling MySQL functions. If we had been targeting
PostgreSQL, instead, string concatenation would be written using a ||, which I
believe is not valid MySQL syntax. Or, if we were using SHA1 hashing, instead
of MD5, our solution would not be portable to PostgreSQL, as it does not
provide a SHA1() function, although it does provide the digest()
function,
which can be used for SHA1 hashing. But, by using the abstraction layer for
db.$md5, db.$sha1, or db.$concat
, the query building engine that
implements these functions for each targeted database can produce the
appropriate query syntax to provide the desired effect, emitting SHA1('text')
on MySQL, but digest('text', 'sha1')
on PostgreSQL.
You may find yourself saying "well, it's nice that you've abstracted the
entire function list, but typing out all of that db.$md5 and db.$concat stuff
is quite tedious in its own right, so I'm not sure I am willing to do that
everywhere that I want to include code like this." That is where the final key
feature of db-filters becomes relevant: special fields. We can update our
filter definition to include a map of special handlers in addition to column
names by supplying a third argument to the constructor:
// Earlier column definitions omitted
var columns = { ... };
var special = {
'salt_password' : function(key, value, terms, options) {
value = db.$eq(db.$md5(db.$concat(db.$field('salt'), value)));
terms.push(value.get('password', this, options));
}
};
var users = new db('users', columns, special);
// Now, use the salt_password special key to invoke the handler, which can automatically apply rules
users.select({name : name, salt_password : pass})
.limit(1)
.exec(success, failure);
This generates identical SQL to the "long method" but it simplifies the
calling code. The special handler is defined once using somewhat uglier syntax,
but it can then be reused many times providing consistency in generating
queries that use that condition, and it does not require sacrificing any
abstraction to do so. This may seem contrived, as password logins generally
happen in one place, but if you require a user to supply a password to change
his settings in as well as when logging in, then you can define your password
transformation once and use it in both places. By pushing terms to the array
that is used to create where clauses, special handlers can emit zero or more
terms, which allows them to be used to perform ORM as well, or any type of
complicated decoding and parameter preparation.
My last two examples will be to demonstrate how db-filters handles INSERT
and UPDATE queries, along with a comparison to db-mysql. In contrast to other
libraries, db-filters uses the same filter decoding logic to specify parameters
for insert and update queries (and delete), allowing you to leverage all of the
function abstraction and the use of special key handlers in these queries as
well, automatically.
-- Create a new post in our thread
INSERT INTO posts SET `threadId` = 1, `userId` = 1, `posted` = NOW(), `post` = 'Sample post'
-- Update the user's post count
UPDATE users SET `postCount` = `postCount` + 1 WHERE `id` = 1
The code will also introduce a useful feature of the helper functions: for
all functions, if one of the arguments is missing, it is assumed to be the
column that keys the value, compressing the syntax for the most common
situation, computing fields as a function of themselves.
var user = {id : 1};
var post_text = 'Sample post';
// With db-filters
posts.insert({threadId : tId, userId : user.id, posted : db.$now(), post : post_text})
.exec(success, failure);
users.update({postCount : db.$add(1)}, {id : user.id})
.exec(success, failure);
// With db-mysql
this.query().insert('posts',
['threadId', 'userId', 'posted', 'post'],
[tId, user.id, {value : 'NOW', escape : false}, post_text])
.execute(callback);
this.query().update('users')
.set({postCount : {value : '`postCount` + 1', escape : false}})
.where('id = ?', [user.id])
.execute(callback);
By now I sound like a broken record, but again, with the approach used in
db-filters, SQL is completely abstracted, whereas db-mysql includes writing raw
SQL, in addition to the tedious syntax used to indicate literals that should
not be escaped. The db-filter code could also be improved by adding a special
key for 'user' that understands how to decode the id property from the user
object, which would simplify the database "client" code that creates and
updates posts.
Final Thoughts and the Future
One thing I want to make a note of is that it seems like I'm panning db-mysql
pretty hard with all of my comparisons, but that isn't really my goal.
Realistically, the purpose of db-mysql is to provide an abstraction layer for
the API to interact with the database, not to generate portable queries, and it
seems that it handles that task well. But, it is always necessary to draw a
comparison when advocating a different way to do things, and so it helps that
db-mysql also attempts to provide a query abstraction interface, something that
I don't think it should be doing (and something that node-mysql doesn't do). I
think separating the two tasks makes a lot of sense: one library to perform low
level communication, and another library to perform query generation, allowing
either to be replaced if a better alternative is developed.
Next, I'd like to briefly address performance. It seems like with a powerful
abstraction layer, you'd sacrifice a lot in terms of performance, but a quick
benchmark of running the example code in this post 10,000 times, generating
70,000 queries total only took approximately 3330 ms, on average, on my Amazon
EC2 micro instance. This works out to less than 50 us per query, which is
significantly less time than you will spend waiting for and interacting with
the database afterwards. The performance thus seems "acceptable" to me, but I
do think it could be faster. This won't really be improved until an
undetermined time in the future, as there are more important things to work
on.
Finally, db-filters still has a long way to go. At this point, it is
possible to create a lot of queries with complex relationships, perform
multi-way joins, and use most of the built-in MySQL functions, all without
writing any actual SQL. But, there are gaps in the function list, mostly in the
date/time area, as these functions can be used in the most diverse ways, and
there are limitations to the types of structures that can be generated using
where clauses. For instance, you cannot yet specify relationships between
fields on two tables in a join using a WHERE clause—the only relationships can
be in the ON parameter, which restricts the relationship to that of equality,
based on the current implementation. Perhaps most importantly, db-filters
currently only targets MySQL. In theory, the abstraction presented should allow
it to be used with any RDBMS, but I am currently only developing the one
translation engine, because I only use MySQL, so support for other engines is
limited to those that are compatible with MySQL's extensions to SQL. I'd really
like to see it ported to support a NoSQL solution with the same interface, but
that would be a tremendous undertaking and will definitely wait until the core
API and features are frozen.
If you've made it this far, I hope I've convinced you that db-filters is the
best solution for database interaction. It is a very specific solution that
attacks only the problem of query generation, and it is specifically designed
to address all of the problems that I've identified with existing solutions.
The project is available on github as
well as through npm. There is more
complete API documentation, along with a function reference in the project
README file, which I will (thankfully) omit here—I'm not trying to explain the
details of using the library; I'm only trying to demonstrate its capability and
advantages. Please post any bugs, feature requests, missing functions, or other
suggestions on the github page.