Techblog #4

Lines of Code

09. March 2022

After the Log4Shell debacle in December (no, I don't want to provide a zillion links) some security aspect comes up in discussions again: Lines of Code, ie. the attack surface of services.

As a measurement, "Lines of Code" spans a wide numerical range. From small 1-line libraries (​cough​ NodeJS is-promise ​cough​) to millions of lines changed in every single Linux kernel release1.

Let's do a small comparison.

$ apt-get source openjdk-17
...
Need to get 63.7 MB of source archives.
Get:1 http://deb.debian.org/debian testing/main openjdk-17 17.0.1+12-1 (dsc) [4250 B]
Get:2 http://deb.debian.org/debian testing/main openjdk-17 17.0.1+12-1 (tar) [63.5 MB]
Get:3 http://deb.debian.org/debian testing/main openjdk-17 17.0.1+12-1 (diff) [198 kB]
...
$ cd openjdk-17-17.0.1+12/
$ LC_ALL=C sloccount .
...
Totals grouped by language (dominant language first):
java:         5139027 (61.55%)
xml:          1291519 (15.47%)
cpp:          1090935 (13.07%)
asm:           404678  (4.85%)
ansic:         374796  (4.49%)
objc:           18535  (0.22%)
sh:             15924  (0.19%)
javascript:     10850  (0.13%)
python:          2182  (0.03%)
awk:              351  (0.00%)
sed:              172  (0.00%)
perl:             114  (0.00%)
jsp:               24  (0.00%)
csh:                3  (0.00%)

Total Physical Source Lines of Code (SLOC) = 8,349,110

Now this is only the plain JDK - no Spring Boot nor any other libraries, dependencies, or actual application code. Some files may be misdetected - there are test files and other stuff included that's not being run in production, etc. - but we'll take that as a rough measure.

Let's compare that with, oh, some other programming language - let's choose Common Lisp, and specifically the most prominent Open Source implementation, SBCL:

$ apt-get source sbcl
...
Need to get 6.767 kB of source archives.
Get:1 http://deb.debian.org/debian testing/main sbcl 2:2.1.11-1 (dsc) [2.565 B]
Get:2 http://deb.debian.org/debian testing/main sbcl 2:2.1.11-1 (tar) [6.688 kB]
Get:3 http://deb.debian.org/debian testing/main sbcl 2:2.1.11-1 (diff) [76,7 kB]
...
$ cd sbcl-2.1.11/
$ LC_ALL=C sloccount .
...
Totals grouped by language (dominant language first):
lisp:        444220 (91.74%)
ansic:        32577  (6.73%)
sh:            4847  (1.00%)
asm:           2532  (0.52%)
cpp:             27  (0.01%)
pascal:           5  (0.00%)

Total Physical Source Lines of Code (SLOC) = 484,208

That's 5.7 Percent of the Java LOC - a bit more than one-twentieth; one-seventeenth, to be more precise.

Now, I'm not a Java person, so I can't really comment on that ecosystem; I wouldn't even know which libraries are needed or recommended. Instead of adding up wrong numbers, let's take only the POC that I know, the one referenced above (in a newer version), and get the numbers for the complete solution.

With Common Lisp being an interactive and introspectable system by design, the compiler and ASDF already record source locations of variables, constants, functions, structure/class definitions, libraries, and so on. I only need to add one special case: the overall SBCL source location, so that the whole SBCL implementation is counted in as well:

(let ((systems))
  ;; Get all loaded systems
  (asdf:map-systems (lambda (f) (push f systems)))
  ;; Fetch source directory paths
  (let ((paths (list* (namestring (translate-logical-pathname #P"SYS:SRC;"))
                      (loop for f in systems
                            for src-dir = (asdf:system-source-directory f)
                            for path = (when src-dir (namestring src-dir))
                            when path collect path)))
        (kept ()))
    ;; Reduce to the common base paths of ASDF subsystems
    (loop for p in (sort paths #'< :key #'length)
          unless (find-if (lambda (c)
                            (alexandria-2:starts-with-subseq c p))
                          kept)
          do (push p kept))
    ;; Run "sloccount" to get statistics
    (uiop:run-program
      (list* "env" "LC_ALL=C"
             "sloccount"
             ;"--details" ;; enable to get per-file LOCs
             kept)
      :output "/tmp/loc.txt"
      :error-output :string)))
NIL
"Warning: newline in string - file ...cxml-20200610-git/doc/index.xml, line 64
Warning: newline in string - file ...cxml-20200610-git/doc/index.xml, line 68
...
"
0

We get a few warnings, but the exit code is zero.

So, let's look at the results. The output file /tmp/loc.txt contains the usual suspects (like trivial-gray-streams, trivial-backtrace, and so on); 58 source locations in total. The LOC count shows:

Totals grouped by language (dominant language first):
lisp:          636448 (93.45%)
ansic:          34173  (5.02%)
xml:             5108  (0.75%)
asm:             2670  (0.39%)
sh:              1867  (0.27%)
perl:             324  (0.05%)
ruby:             321  (0.05%)
javascript:        51  (0.01%)
pascal:            42  (0.01%)
java:              31  (0.00%)
cpp:               27  (0.00%)
awk:               10  (0.00%)

Total Physical Source Lines of Code (SLOC) = 681,072

So the full POC of an HTTPS-enabled application is less than 700 thousand lines of code2, or 8.2% of the Java 11 Development Kit only (no libraries, no frameworks, no application code counted there!)…
Turning that ratio around, with a Java solution there are (rather, would be - Log4Shell shows nobody does that) more than 10 times as many LOC to review.

Ain't that a good reason (one of many) to learn a new3, more effective, programming language?

 

1Linux 5.14 from Git has 20755459 LOC, according to sloccount; https://lwn.net/Articles/867540/ reports +861000 -321000 LOC.
2These 680 KLOC even include duplicated unicode tables – eg. SBCL has src/code/external-formats/enc-cn-tbl.lisp (44973 LOC), while flexi-streams-20210807-git provides a (semantically probably identical) enc-cn-tbl.lisp (48314 LOC) – and then there's a enc-jpn-tbl.lisp

3Well, rather an old programming language, really.

TechBlog #3

The Austrian Public Services Blockchain

The BRZ got a useful idea for a Blockchain project: notarizing document existence via their cryptographic hashes, codename "Blockstempel".

more The Austrian Public Services Blockchain

Techblog #2

Short intro to "Grants4Companies"

Our project "Grants for Companies" won the first price in the competition "eGovernment Wettbewerb 2021". In this blogpost, we show you some details about the implementation.

more Short intro to "Grants4Companies"