Project

General

Profile

Feature #4012

replacing TCP/IP (AF_INET) sockets with Unix Domain Sockets (AF_UNIX)

Added by Greg Shah about 5 years ago. Updated over 4 years ago.

Status:
Review
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
Due date:
% Done:

100%

billable:
No
vendor_id:
GCD
version:

build.gradle.diff Magnifier (577 Bytes) Hynek Cihlar, 03/26/2019 04:06 PM

directory.xml.diff Magnifier (1.68 KB) Hynek Cihlar, 03/26/2019 04:07 PM


Related issues

Related to Database - Bug #4071: FWD server stops allowing new web client logins after abend New

History

#1 Updated by Greg Shah about 5 years ago

Copied from #3992, from Eric:

This may be worth a look: https://impossibl.github.io/pgjdbc-ng/

The guy behind this is Kevin Wooten, whom I've seen on the PostgreSQL mailing list for a long time. I recall he announced this a few years ago, and I had since forgotten about it. I didn't realize it supported Unix Domain Sockets until I googled around for alternative JDBC drivers tonight.

I would expect UDS would have less overhead than TCP, so it may be worth a try. I tried briefly getting it going tonight, but I didn't get the dependencies figured out and at this point, I'm too tired. It may need some config in pg_hba.conf. Mine is wide open for peer connections, but that's probably not right. Of course, directory.xml would at minimum need changes to the JDBC url and driver_class settings. Looks like there's a Netty dependency as well.

Hynek or Constantin, could one of you give it a shot? I'm curious to know if the UDS implementation gives us any advantage. Since all the small queries in the converted code make FWD so chatty with the database, any improvement in throughput with the database should be noticeable.

#2 Updated by Greg Shah about 5 years ago

Hynek's results, copied from #3992:

Hynek Cihlar wrote:

Eric Faulhaber wrote:

Hynek or Constantin, could one of you give it a shot? I'm curious to know if the UDS implementation gives us any advantage. Since all the small queries in the converted code make FWD so chatty with the database, any improvement in throughput with the database should be noticeable.

I can look at it.

I integrated the driver in the project, I can run the app, but shortly after login, I get the following error when executing the query DECLARE cursor1c BINARY SCROLL CURSOR FOR select nextval('p2j_id_generator_sequence') from generate_series($1, $2).

code = "42725" 
column = null
constraint = null
datatype = null
detail = null
file = "parse_func.c" 
hint = "Could not choose a best candidate function. You might need to add explicit type casts." 
line = "506" 
message = "function generate_series(unknown, unknown) is not unique" 
position = "92" 
routine = "ParseFuncOrColumn" 
schema = null
severity = "ERROR" 
table = null
where = null

The corresponding source (in com.impossibl.postgres.jdbc.PGPreparedStatement.describeIfNeeded):

    StatementDescription cachedDescription = connection.getCachedStatementDescription(sqlText, () -> {

      PrepareResult result = connection.execute(timeout -> {
        PrepareResult handler = new PrepareResult();
        connection.getRequestExecutor().prepare(null, sqlText, EMPTY_TYPES, handler);
        handler.await(timeout, MILLISECONDS);
        return handler;
      });

      return new StatementDescription(result.getDescribedParameterTypes(connection), result.getDescribedResultFields());
    });

Impossibl doesn't pass the query parameter types when doing prepare of the query, which causes the ambiguity error. I think this is a show stopper for us.

Besides, there is another issue. Impossibl will not handle our custom locale en_US@P2J_basic. It parses the locale name incorrectly. As a workaround, to get further, I changed the cluster locale.

#3 Updated by Eric Faulhaber about 5 years ago

Hynek, what did your pg_hba.conf file look like to get as far as you did for this attempt?

#4 Updated by Hynek Cihlar about 5 years ago

Eric Faulhaber wrote:

Hynek, what did your pg_hba.conf file look like to get as far as you did for this attempt?

1. In pg_hba.conf I changed the line
local all all peer
to
local all all md5

2. I set the locale in postgresql.conf to en_US.utf8
3. I modified build.xml and directory.xml (diffs atatached)

#5 Updated by Greg Shah about 5 years ago

Early implementations of the Progress Database Server implemented connectivity using shared memory. This design, combined with the fact that at its core the Progress database is an index engine, means that the 4GL approach of record by record retrieval was something that would perform well. It wasn't until much later that the 4GL provided set oriented features (e.g. OPEN QUERY). Even FOR EACH, which seems like a set oriented operation, is no such thing. The FOR EACH will access records one at a time, just like a looping FIND NEXT. In fact, if you edit the returned records in the same fields that are driving the index walk, then you can cause an infinite loop. There is no result set in a FOR EACH.

The problem here is that most 4GL code (and 4GL developers) already existed long before the set oriented features arrived. Thus most 4GL code and developers are deeply coupled with record oriented retrieval. On a positive note, using shared memory is the fastest possible transport for connection to a database that exists in a separate process space. This means the database performance of this kind of code is good.

Converting to Java and a SQL based RDBMS such as PostgreSQL means that we have really robust tools for set oriented data access. But it also means that we must do extra work to catch up with the record oriented retrieval over shared memory. Our persistence team has done some really remarkable work in this area. For example, the 4GL client runs WHERE clauses and any BY clause that doesn't naturally match the sorting of the index being walked. But in FWD, both the filtering and sorting can almost always be pushed over to the database. This greatly reduces the amount of round trips to the database and it allows the sorting to be done by the component of the system that is best suited.

Even with these improvements, it is still true that we generate many more round trips to the database than would be needed if the converted 4GL code was set oriented. Although we still want to work on ways to reduce those round trips, it is clear that if the cost of those round trips can be greatly reduced, then the performance will improve massively.

This Progress database advantage only exists to the extent that the 4GL code is running on the same system as the database. In fact, their TCPIP performance is quite poor (maybe it will get better in OE12 with their multi-threaded database server). So this is an inherent limit on scaling 4GL systems horizontally. Since this is the most common use case, it is important that we handle this well.

Unix Domain Sockets are significantly faster than TCPIP sockets, so long as you are only accessing a process on the same machine.

Although it appears not possible to use the Impossibl project (couldn't resist, sorry :)), the idea of using unix domain sockets is a very good one. In fact, I'm disappointed with myself for not investigating this sooner. I've been in the pg conf many times before and it clearly supports unix domain sockets as a transport. Better late than never.

https://stackoverflow.com/questions/14973942/tcp-loopback-connection-vs-unix-domain-socket-performance

This link provides some references for understanding the performance difference between TCPIP sockets and Unix Domain Sockets. The short answer is that the performance of Unix Domain Sockets will be somewhere between 2x and 7x as compared to TCPIP sockets.

What is a Unix Domain Socket? The core idea is that it is a local IPC (interprocess communication) mechanism which provides the same API as TCPIP sockets but which does not require or utilize any of the networking layers of the OS. It provides full bidirectional communication between two processes on the same system. Since it is a streaming IPC, it works just like TCPIP sockets in that there is a connection and the data is reliably delivered in a known order. TCP/IP to provide this capability over a network. This means that there is a great deal of extra processing to read/write headers (multiple layers: 802.2 Ethernet + IP + TCP), implement addressing, routing, flow control, packet sequencing, resending lost packets and so forth. This is just the extra processing in the network stack of the OS. There is also the actual transmission of the data over the network, possibly through multiple routers/gateways and then all the network stack processing on the remote side of the socket. Even when using localhost, that just eliminates the network transmission and maybe a bit of the network stack. But TCPIP sockets still has most of the expensive network stack processing on both sides of the localhost connection. Unix Domain Sockets do not have any of that work.

To understand how to write C code for Unix Domain Sockets, read this:

https://troydhanson.github.io/network/Unix_domain_sockets.html
https://stackoverflow.com/questions/14919441/principle-of-unix-domain-socket-how-does-it-works
https://blog.myhro.info/2017/01/how-fast-are-unix-domain-sockets

If you want to understand the kernel implementation:

https://github.com/torvalds/linux/blob/master/net/unix/af_unix.c

You'll also need to understand about SCMs or socket control messages, SKBs or socket buffers and how SKBs handle data.

The bottom line is that the implementation of Unix Domain Sockets is quite small (1 kernel file) and can be read in about 30 minutes. There is some non-trivial processing, especially in the setup of the socket. But this is nothing compared to the work done in the network layer. The core idea here is that the application code (in user space) writes data to the AF_UNIX socket, which is a down-call into kernel space. The data is copied out of the user buffer and into an SKB (kernel socket buffer). It is enqueued in a socket using a linked list and I believe it is read out of the other end of the AF_UIX socket using that same SKB. Then the reading process calls down to the kernel to get the next packet, when one arrives it is copied from the SKB into the user space buffer and the read returns.

PostgreSQL uses a streaming protocol. It is well designed for sockets usage. It is not as well designed for shared memory. But the sockets API is the same for both TCPIP sockets and Unix Domain Sockets. So it should not be a surprise that both transports are supported in PostgreSQL.

The problem here is that J2SE doesn't provide built-in support for Unix Domain Sockets, only TCP/IP sockets. Thus, the standard PostgreSQL JDBC Driver by default only supports TCP/IP sockets.

Interestingly, there are some implementations of AF_UNIX support in Java:

https://stackoverflow.com/questions/170600/unix-socket-implementation-for-java
https://netty.io/
https://github.com/fiken/junixsocket

This last one has a PostgresqlAFUNIXSocketFactory that can be used with the PostgreSQL JDBC driver. This seems like a good place to start.

#6 Updated by Greg Shah about 5 years ago

Using shared memory, we could eliminate the 4 kernel space transitions (write() goes down into the kernel and returns, read() on the other side goes down into the kernel and returns) and the extra 2 copies into and out of the SKB. This would be a significant improvement in performance over AF_UNIX sockets. But we would have to implement our own "protocol" and synchronization facilities. We could implement Futxes which are a (mostly) user-space synchronization primitive. Still, there would be some real work in mapping the streaming protocol of PostgreSQL to a shared memory approach (see https://www.joshbialkowski.com/posts/2018/linux_zero_copy/linux-zero-copy.html for some useful points). AND we would have to modify the PostgreSQL server itself to add this new transport. Using AF_UNIX sockets is already supported in the PostgreSQL server so that is a massive advantage. We may come back to shared memory in the future, but for now the AF_UNIX is a very good start. There are some useful comments here:

https://news.ycombinator.com/item?id=248244

By the way, the person that is arguing the AF_UNIX sockets are as performant is simply incorrect.

#7 Updated by Greg Shah about 5 years ago

  • Parent task deleted (#4011)

#8 Updated by Eric Faulhaber about 5 years ago

Greg Shah wrote:

https://github.com/fiken/junixsocket

This last one has a PostgresqlAFUNIXSocketFactory that can be used with the PostgreSQL JDBC driver. This seems like a good place to start.

I implemented this in 3750b rev 11552 and tested it with a customer application doing database intensive, real work. I could not notice with the naked eye any improvement in performance across a wide range of repeated workflows. In fact, one workflow was consistently slower, by up to ~30%, with the UDS implementation enabled.

However, I don't think this was a causal relationship, and profiling confirmed this. Before profiling in both cases, I warmed up the JVM and database with 4 passes of the same workflow without profiling, until the time spent had stabilized. Then I did one pass each with trace-leve profiling.

The results did not show any meaningful overall time difference between the two scenarios. It further showed that with almost exactly the same low-level calls to read from the database (132 w/ UDS vs. 130 w/ TCP/IP), doing the same work, the UDS implementation was nearly an order of magnitude faster than TCP/IP. This ignores the time spent by SSL decrypting data in the TCP/IP case, which BTW, I didn't even realize I was using -- must be the default in the JDBC driver.

While the UDS work was faster, we're talking about several dozen milliseconds difference (in tracing profile time, which is 2 orders of magnitude slower than real time) out of an overall 645K milliseconds (in profile time) for the workflow. The difference represents an exceptionally small portion of the time of the overall workflow, so it is not noticeable to the user and was in fact overwhelmed by some other noise in "normal" use.

Perhaps there are other use cases where this difference is more meaningful.

To enable, add the following entries inside the hibernate/connection section (directly below the connection URL is a good place) for the target PostgreSQL database, in directory.xml:

                <node class="string" name="socketFactory">
                  <node-attribute name="value" value="org.newsclub.net.unix.socketfactory.PostgresqlAFUNIXSocketFactory"/>
                </node>
                <node class="string" name="socketFactoryArg">
                  <node-attribute name="value" value="/var/run/postgresql/.s.PGSQL.5433"/>
                </node>

The "5433" is the port on which my FWD cluster runs. The default PostgreSQL port is 5432. So, the number you use here depends on your cluster installation.

No other change is needed.

#9 Updated by Greg Shah about 5 years ago

No other change is needed.

Don't we also need to add code (jar(s) and a native library) to the system to make this work?

I also wonder if this works on Windows. I would expect it to work.

In a system with more TCPIP contention (e.g. a system that is being used by multiple web users), this optimization may make more of a difference. I also wonder if this makes more of a difference in cases where there is a large amount of data being transferred. Perhaps the customer's use cases doesn't have any queries that bring back large data?

#11 Updated by Greg Shah about 5 years ago

Don't we also need to add code (jar(s) and a native library) to the system to make this work?

I see you've made changes to build.gradle in 3750b rev 11552.

What about PGCONF changes? Are AF_UNIX sockets enabled by default?

#12 Updated by Eric Faulhaber about 5 years ago

Greg Shah wrote:

What about PGCONF changes? Are AF_UNIX sockets enabled by default?

They are, at least in the Ubuntu install. I'm not sure about other platforms. I did need to make one change in pg_hba.conf, which was to change the line

# "local" is for Unix domain socket connections only
local   all             all                                     peer
# "local" is for Unix domain socket connections only
local   all             all                                     md5

One can further tighten security by changing the first all to a specific database name and the second all to a specific database role.

#13 Updated by Eric Faulhaber about 5 years ago

  • Status changed from New to Review
  • % Done changed from 0 to 100

Greg, I think we can close this.

#14 Updated by Eric Faulhaber over 4 years ago

  • Related to Bug #4071: FWD server stops allowing new web client logins after abend added

#15 Updated by Eric Faulhaber over 4 years ago

I think #4071 is caused by the use of AF_UNIX sockets, at least with the implementation noted above. This is a best guess, based on the timing of when we started using AF_UNIX sockets and when #4071 started being a problem, and the fact that I haven't had the #4071 problem since disabling the AF_UNIX configuration. However, I have not definitively tracked the root cause of #4071 to this change.

Also available in: Atom PDF