Python: Fix header field values character encoding.

On GitHub, @RomainMou reported an issue whereby HTTP header field values where being incorrectly reported as non-ascii by the Python .isacii() method. For example, using the following test application def application(environ, start_response): t = environ['HTTP_ASCIITEST'] t = "'" + t + "'" + " (" + str(len(t)) + ")" if t.isascii(): t = t + " [ascii]" else: t = t + " [non-ascii]" resp = t + "\n\n" start_response("200 OK", [("Content-Type", "text/plain")]) return (bytes(resp, 'latin1')) You would see the following $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [non-ascii] '$' has an ASCII code of 0x24 (36). The initial idea was to adjust the second parameter to the PyUnicode_New() call from 255 to 127. This unfortunately had the opposite effect. $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] Good. However... $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [ascii] Not good. Let's take a closer look at this. '£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3 (163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes above. $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/ sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92 recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160 '£' (2) [ascii] So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not be marked as ASCII. When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we need to use another function and that function would appear to be PyUnicode_DecodeCharmap() Which creates an Unicode object with the correct ascii/non-ascii properties based on the character encoding. With this function we now get $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [non-ascii] and for good measure $ curl -H "ASCIITEST: $ £" http://localhost:8080/ '$ £' (4) [non-ascii] $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/ '$, £' (5) [non-ascii] PyUnicode_DecodeCharmap() does require having the full string upfront so we need to build up the potentially comma separated header field values string before invoking this function. I did not want to touch the Python 2.7 code (which may or may not even be affected by this) so kept these changes completely isolated from that, hence a slight duplication with the for () loop. Python 2.7 was sunset on January 1st 2020[0], so this code will hopefully just disappear soon anyway. I also purposefully didn't touch other code that may well have similar issues (such as the HTTP header field names) if we ever get issue reports about them, we'll deal with them then. [0]: <https://www.python.org/doc/sunset-python-2/> Link: <https://docs.python.org/3/c-api/unicode.html> Closes: <https://github.com/nginx/unit/issues/868> Reported-by: RomainMou <https://github.com/RomainMou> Tested-by: RomainMou <https://github.com/RomainMou> Signed-off-by: Andrew Clayton <a.clayton@nginx.com>
author: Andrew Clayton <a.clayton@nginx.com> 2023-05-26 20:54:43 +0100
committer: Andrew Clayton <a.clayton@nginx.com> 2023-11-09 17:53:09 +0000
commit: 5cfad9cc0bb3809f802cf83d2739739fdfaab7a8 (patch)
tree: 76721197726fb4fd7ce397d0c8a70a80c65f1216 /src
parent: dd0c53a77dbd3c8b7e3496c5e15cef757346ef8b (diff)
download: unit-5cfad9cc0bb3809f802cf83d2739739fdfaab7a8.tar.gz
unit-5cfad9cc0bb3809f802cf83d2739739fdfaab7a8.tar.bz2
1 files changed, 28 insertions, 2 deletions
diff --git a/src/python/nxt_python_wsgi.c b/src/python/nxt_python_wsgi.c
index ce62534b..c621097e 100644
--- a/src/python/nxt_python_wsgi.c
+++ b/src/python/nxt_python_wsgi.c
@@ -866,10 +866,35 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
     src = nxt_unit_sptr_get(&f->value);
 
 #if PY_MAJOR_VERSION == 3
-    res = PyUnicode_New(vl, 255);
+    if (nxt_slow_path(n > 1)) {
+        char  *ptr;
+
+        p = nxt_unit_malloc(NULL, vl + 1);
+        if (nxt_slow_path(p == NULL)) {
+            return NULL;
+        }
+
+        ptr = p;
+        p = nxt_cpymem(p, src, f->value_length);
+
+        for (i = 1; i < n; i++) {
+            p = nxt_cpymem(p, ", ", 2);
+
+            src = nxt_unit_sptr_get(&f[i].value);
+            p = nxt_cpymem(p, src, f[i].value_length);
+        }
+        *p = '\0';
+
+        src = ptr;
+    }
+
+    res = PyUnicode_DecodeCharmap(src, vl, NULL, NULL);
+
+    if (nxt_slow_path(n > 1)) {
+        nxt_unit_free(NULL, src);
+    }
 #else
     res = PyString_FromStringAndSize(NULL, vl);
-#endif
 
     if (nxt_slow_path(res == NULL)) {
         return NULL;
@@ -885,6 +910,7 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
         src = nxt_unit_sptr_get(&f[i].value);
         p = nxt_cpymem(p, src, f[i].value_length);
     }
+#endif
 
     return res;
 }
author	Andrew Clayton <a.clayton@nginx.com>	2023-05-26 20:54:43 +0100
committer	Andrew Clayton <a.clayton@nginx.com>	2023-11-09 17:53:09 +0000
commit	5cfad9cc0bb3809f802cf83d2739739fdfaab7a8 (patch)
tree	76721197726fb4fd7ce397d0c8a70a80c65f1216 /src
parent	dd0c53a77dbd3c8b7e3496c5e15cef757346ef8b (diff)
download	unit-5cfad9cc0bb3809f802cf83d2739739fdfaab7a8.tar.gz unit-5cfad9cc0bb3809f802cf83d2739739fdfaab7a8.tar.bz2