Running Awk in parallel to process 256M records

winrid · on Aug 29, 2021

Late night semi related rant while I can't sleep.

I worked at one place where we had this big distributed system for processing about 1M table rows (recalculate some stuff with latest code to see if latest code has regressions).

I joined a couple years after launch and it took months to get it working okay and have good visibility on it.

It took about eight hours to run, eventually got it down to three. The actual calculations only took like a second per object, so with 24 or so VMs you get the 8 hours. Sometimes it would take too long and the cron would seed the next batch of items in the queue without checking if it was empty, resulting in a process that never finished!

You're probably thinking, just add more nodes! Scale horizontally! We'll, we were on kubernetes. Except we weren't allowed to use k8s. We had to use this abstraction provided by devops AROUND k8s. This framework had some limitations.

Also, simply scaling horizontally would take down the production DB due to number of connections, instead of say using multithreading and reusing connections.

I had a solution that ran through all the data on my MacBook with GNU parallel in less than an hour, but I could never convince the architects to let me deploy it. :)

So, distributed stuff can be really nice. But if you're having trouble building the simple version done well, probably don't make it distributed yet. I might still have PTSD from "hey can you run the Thing on your laptop again? The Thing in prod won't finish for another 9 hours."

eggy · on Aug 29, 2021

AWK seems to be having a renaissance, and I wonder if it is only because Perl sort of lost favor for a while to Python and others, while Perl 6 being renamed Raku added further confusion. Using options like 'perl -pie' gives you a nice subsystem to perform your AWK-like operations, and you have a lot more in Perl to back it up if needed. I am seeing AWK pop up here and in other forums, but maybe I am just focused on it for now.

asicsp · on Aug 29, 2021

>AWK seems to be having a renaissance

I'd say grep, sed and awk are essential tools to know for cli text processing. I think more programmers are getting used to cli (git, wsl, etc) which would lead to these tools.

I reach for perl when I need its advanced regexp features, built-in functions, etc.

eggy · on Aug 29, 2021

I agree especially since I started programming in 1978 and awk had just been created a year earlier. I did not learn of it until the late 80s with gawk, and sed much later. I forget when I first touched grep. I do remember using perl quite a bit for text munging in the early 90s. I currently use Powershell a lot more because of its convenience in Windows, and all of the assemblies you can call on, and I still use awk, grep, and others through WSL2. My favorite language, J, is fun for the same tasks [0,1].

  [0]  jsoftware.com
  [1]  https://dev.to/bakerjd99/jetl-j-extract-transform-and-load-9cm

asicsp · on Aug 29, 2021

Previous discussion: https://news.ycombinator.com/item?id=23394024

See also "Command-line Tools can be 235x Faster than your Hadoop Cluster": https://news.ycombinator.com/item?id=22188877

nafizh · on Aug 29, 2021

You can definitely use awk (I use it myself). But lets not pretend it’s readable for anyone after the original writer. It has a single purpose and that is to get the text munging task in front of you done as quickly as possible.

dundarious · on Aug 29, 2021

I disagree. The scripts in the article are very readable, and my personal experience has been that short awk scripts are as messy/unreadable as short scripts in any other language — or as clean/readable. I don’t even think this would be different for longer scripts.

kazinator · on Aug 29, 2021

Awk code, especially throwaway one-liners, can be quite readable. And if not so absolutely, then better than the alternatives.

Awk is not actually very good for text munging, by the way; it's for munging semi-numerical text in a record and field format.

It has hidden pitfalls, like objects being treated as numeric strings, depending on which source they came from.

If a and b came from a positional field then a == b is true if a is "5.00" and b is "0.5E1". But "5.00" == "0.5E1" is false. If you want to ensure a string comparison, and it's not clear in the code what the provenance is of the arguments, you have to do something like a "" == b: catenate an empty string to force string mode. Likewise in some situations you have do a + 0 == b to force numeric.

The text processing in Awk fairly poor, and the core of it is just a handful of library functions that have nicer implementations in other languages.

Here is GNU Awk code I wrote in 2014 for parsing a regex: not a problem of parsing records with semi-numerical fields. It's a ecursive descent job just to validate the syntax, not to translate it to anything:

  #!/usr/bin/gawk -f
  
  function empty(s)
  {
    return s == ""
  }
  
  function eat_char(s)
  {
    return substr(s, 2)
  }
  
  function eat_chars(s, n)
  {
    return substr(s, n + 1)
  }
  
  function matches(s, pfx)
  {
    return substr(s, 1, length(pfx)) == pfx 
  }
  
  function match_and_eat(s, pfx)
  {
    if (matches(s, pfx))
      return eat_chars(s, length(pfx))
    return s
  }
  
  function eat_rchar(c)
  {
    if (c ~ /^\\./)
      return eat_chars(c, 2)
  
    if (c == "$")
      return c
  
    if (c !~ /^[\[\*\+\?{}\(\)|]/)
      return eat_char(c)
  
    return c
  }
  
  function eat_bchar(c)
  {
    if (c ~ /^\\]|\\-|\\\\/)
      return eat_chars(c, 2)
  
    if (c !~ /^[\-\[]/)
      return eat_char(c)
  
    return c
  }
  
  function eat_class(c)
  {
    c = match_and_eat(c, "[:alnum:]")
    c = match_and_eat(c, "[:alpha:]")
    c = match_and_eat(c, "[:blank:]")
    c = match_and_eat(c, "[:cntrl:]")
    c = match_and_eat(c, "[:digit:]")
    c = match_and_eat(c, "[:graph:]")
    c = match_and_eat(c, "[:lower:]")
    c = match_and_eat(c, "[:print:]")
    c = match_and_eat(c, "[:punct:]")
    c = match_and_eat(c, "[:space:]")
    c = match_and_eat(c, "[:upper:]")
    return match_and_eat(c, "[:xdigit:]")
  }
  
  function eat_bracket_exp(e,
                           #local
                           f, o)
  {
    o = e
    e = eat_char(e)
  
    for (;;) {
      if (matches(e, "]")) {
        return eat_char(e)
      }
  
      if (matches(e, "[")) {
        f = eat_class(e)
        if (f == e)
          return o
        e = f
        continue;
      }
  
      f = eat_bchar(e)
  
      if (f == e)
        return o
      e = f
  
      if (matches(e, "-")) {
        e = eat_char(e)
        f = eat_bchar(e)
        if (f == e)
          return o
        e = f
      }
    }
  }
  
  function eat_rep_notation(n)
  {
    n = eat_char(n)
  
    if (n !~ /^[0-9]/)
      return n
  
    while (n ~ /^[0-9]/)
      n = eat_char(n)
  
    if (matches(n, "}"))
      return eat_char(n)
  
    if (!matches(n, ","))
      return n
  
    n = eat_char(n)
  
    if (matches(n, "}"))
      return eat_char(n)
  
    if (n !~ /^[0-9]/)
      return n
  
    while (n ~ /^[0-9]/)
      n = eat_char(n)
  
    return match_and_eat(n, "}")
  }
  
  function eat_factor(f)
  {
    if (matches(f, "("))
      return match_and_eat(eat_regex(eat_char(f)), ")")
  
    if (matches(f, "["))
      return eat_bracket_exp(f)
  
    return eat_rchar(f)
  }
  
  function eat_term(t,
                    #local
                    s)
  {
    s = eat_factor(t)
  
    if (empty(s) || s == t)
      return s
  
    t = s
  
    if (t ~ /^[?+*]/)
      return eat_char(t)
  
    if (matches(t, "{"))
      return eat_rep_notation(t)
  
    return t
  }
  
  function eat_regex(r,
                     #locals
                     s)
  {
    if (empty(r))
      return r
  
    s = eat_term(r)
  
    if (empty(s) || s == r)
      return s
  
    r = s;
  
    if (matches(r, "|"))
      r = eat_char(r)
  
    return eat_regex(r)
  }
  
  
  function is_regex(r)
  {
    if (matches(r, "^"))
      r = eat_char(r)
  
    if (empty(r))
      return 1
  
    r = eat_regex(r)
  
    if (r == "$")
      r = ""
  
    return empty(r);
  }
  
  
  #{
  #  printf("eat_rep_notation(%s) = %s\n", $0, eat_rep_notation($0));
  #}
  
  {
    # for (i = 0; i < 10000; i++)
    #  is_regex($0)
     printf("is_regex(%s)\n", is_regex($0) ? "yes" : "no")
  }

winrid · on Aug 30, 2021

I'm surprised at how close to JS it appears.

I wonder how much better off you'd be just writing it in Node.