Tay Ray Chuan home archive

semantic grep

Fri, 20 Jun 2014 22:51:44 +0800 | Filed under static analysis

Some time back, I had written a small patch for the rspec project; you can see it here. Today, I don't quite remember why I wrote it; perhaps I was trying to get started contributing to rspec.

As you can see, it was really a very trivial patch. All I did was to replace blocks that invokes an instance method, of 0-arity, with the '&:' syntax. For example,

arr.map {|x| x.do_this() }

gets replaced by

arr.map &:do_this

As a result, some backward-compatibility code could be removed.

However, it got me thinking: could we look for other such structures? The usual approach would be to craft a regular expression, then run it through grep:

\{\|\w+\| \w\.\w \}

(I don't even know if that works.) Then, we need to extend it to handle do..end style blocks. Then, we need to handle non-parentheses method calls. Then, we need to handle optional spaces...you get the picture: the final expression would look like an unintelligble, gibberish mess.

What if we could grep at the syntax-structure level? That way, we wouldn't need be concerned with tittle-tattle like whitespace. If you took a compilers/programming language class, then you would be familiar with the ASTs/parse trees. For Ruby, I was aware of sexp_processor, having followed zenspider; but what about searching through an sexp? Googling "ruby sexp grep" led me to sexp_path. (Path like how XPath is for XML, geddit?)

For the longest time, I had no progress on this, as I had some trouble wrapping my head around sexp_path. This was not the fault of sexp_path; it had a fair bit of documentation. I realised that, for us to do such a search, we would need:

  1. a way to specify the query (what we want to search), which would necessarily require an understanding of
  2. what we are matching against.

This seems obvious. I realised I wasn't really sure about (2): what we are matching against. How could I grasp what to construct (1) if I didn't know what I was matching?

So, let's dig into sexp_path. In sexp_path, paths/queries are matched against the output of ruby_parser: S-expressions that are represented as Ruby arrays. Consider the expression I was trying to find in the rspec patch:

config_string.split(/\n+/).map {|l| Shellwords.shellwords(l) }.flatten

Running this through ruby_parser:

s(:call,
 s(:iter,
  s(:call,
   s(:call, s(:call, nil, :config_string), :split, s(:lit, /\n+/)),
   :map),
  s(:args, :l),
  s(:call, s(:const, :Shellwords), :shellwords, s(:lvar, :l))),
 :flatten)

As you can see, the first element of the array is the "type" of the expression - eg. a function call are represented by a :call, while blocks (be it do...end or {}) are represented by an :iter.

Aha! Now sexp_path's documentation makes sense. The query we need (1) seems obvious:

s(:iter,
 _,
 s(:args, atom % "p1"),
 s(:call, s(:lvar, atom % "p2"), _))

Actually, the captures p1 and p2 should be declared equal, but there is no way to express this (as far as I understand sexp_path) like how we can do assertions in regular expressions (positive/negative lookahead/lookbehind).

Let's run this on rspec-core:

$ find . -name '*.rb' | xargs ruby `gem contents sexp_path | grep examples/sexp_grep` \
's(:iter, _, s(:args, atom % "p1"), s(:call, s(:lvar, atom % "p2"), _))'

PS. Suggestions are welcome for cleaner ways to get a file from an installed gem.

PPS. We need to find then pipe to xargs because globs aren't handled by sexp_path's sexp_grep.

This throws up a couple of results. One of them is lib/rspec/core/world.rb, line 93:

FlatMap.flat_map(groups) {|g| g.descendants}.

This is indeed what we are looking for.

What about false negatives - that is, things that sexp_path falsely regarded as non-matches? Unfortunately, it is difficult to check for this. We just have to put our faith in sexp_path and our query, till the day we find an instance which was missing among the reported matches...

What about general tools that could be used for any language? Searching for "ast grep", led me to "semantic grep". I had a look at some of these; below are my observations of them.

Coccinelle (French for ladybug, in case you're wondering, like I did) provides a way to "patch" a source tree. If you've worked with a non-trivial project before, where an abstraction (data structure, class, etc.) is defined in one place and used in many other places, a grep just doesn't cut it. For a dynamic language, tests would be helpful, but not conclusive, unless you have good test coverage. The project page has some examples that patch the Linux kernel - pretty interesting. In terms of grep, I could not find one, but I'm pretty sure it's possible - after all, if they could patch, they must have a way to first determine the suitability of the patch (ie. a match).

Facebook's pfff provides semantic grep via the sgrep tool. Documentation can be found here. This seems more powerful than sexp_path, because your query doesn't have to be an sexp (as in the case of sexp_path), but can be pseudo-source language. However, while it is language-agnostic, only PHP is well-supported, and it is not clear how support for another language could be implemented.

My search of the above with regard to Ruby had turned up coypond. In this case, the path expressions seem less powerful, as they are able to match instance members only (eg. Thumbnail#initialize).

While reading up on pfff, I encountered this on the wiki page regarding the motivations of sgrep (source):

Another solution would be to use a compiler frontend and write a visitor on the abstract syntax tree that recognizes the complex pattern. Unfortunately this is also tedious to write as a compiler frontend is usually a large software and the abstract syntax tree is a complex structure.

("also tedious" was in reference to a method described prior to the quote.)

This led me to wonder if LLVM, which I knew to have a good separation of the frontend parsing and its analysis and optimization, had such an ability. Googling did not turn up anything. I'm pretty sure interfaces to visit ASTs are already available in llvm; all that would remain would be an interface to accept patterns/queries, much like how sexp_path and sgrep do, and traverse the AST for us. This would make for an interesting project.

blog comments powered by Disqus