UOA11-04: Semmle: a powerful query language for analysing large data sources

Submitting Institution

University of Oxford

Unit of Assessment

Computer Science and Informatics

Summary Impact Type

Technological

Research Subject Area(s)

Information and Computing Sciences: Computation Theory and Mathematics, Computer Software
Psychology and Cognitive Sciences: Cognitive Sciences


Download original

PDF

Summary of the impact

Semmle is a successful spin-out company set up by members of the UoA, based on their research on program analysis. Semmle markets an industrial-strength product allowing organisations with large software systems to understand and manage their code bases. This business intelligence platform started to be sold to prominent customers in 2008, including [text removed for publication] NASA. NASA used it to help ensure the safe landing of the Curiosity Mars Rover.

Underpinning research

From the late 90s, the Programming Tools team in the UoA researched ways to implement program analyses in a declarative style. Initially the emphasis was on functional programming, and the first major achievement was the invention of an efficient higher-order matching algorithm by Ganesh Sittampalam and Oege de Moor in 1998. This was an example of selecting a pattern language at the optimum point between expressivity (a bit more than 2nd order but not all of 3rd order) and efficiency. Also, the implementation heavily relied on ideas from logic programming [1]. The next major step was the development of a language for implementing optimising compiler transformations based on a combination of rewriting and side conditions phrased in temporal logic, by David Lacey and Oege de Moor in 2001 [2, 3]. A salient feature was the ability to bind free variables while matching the temporal logic formulae. The link to restricted forms of logic programming became even more apparent at this point, and joint work by Sittampalam, Lacey, De Moor and others explored this connection to express particular kinds of recursive analyses as logic programs that query control flow graphs, with publications in 2002 and 2003. Again the key was to strike the right balance between expressivity and efficiency, by not using the full power of a Turing-complete language.

At that time "aspect-oriented programming" was coming into prominence, and it appeared a perfect area of application for the technologies developed in the Programming Tools team. In a landmark paper in 2003, Damien Sereni and Oege de Moor showed that the earlier work with Sittampalam could achieve dramatic speedups in the execution of aspect-oriented programs. This was the beginning of the work on aspect-orientated program optimisation.

In 2004 a new optimising compiler for the most popular aspect-oriented programming language, AspectJ, was implemented by the Programming Tools team, demonstrating that the theoretical speedups could be achieved on an industrial scale [4]. In the meantime a number of other research groups had started using the Datalog programming language, a much restricted logic programming language that is a subset of PROLOG, for the purpose of program analysis. In the UoA, Elnar Hajiyev, together with Mathieu Verbaere and Oege de Moor, provided a Datalog implementation for experimentation using traditional database technology instead of the binary decision diagrams used by others, building on previous work by Thomas Reps at Wisconsin (who also used it for program analysis applications) and others [5].

It became apparent that no-one properly understood the semantics of the event patterns (called "pointcuts") in AspectJ, and therefore a clear semantics of its pattern language was required. The Programming Tools team addressed this question by translating that pattern language into Datalog, and giving a simple implementation through that translation [6].

References to the research

The three asterisked outputs best indicate the quality of the underpinning research.

[1] Oege de Moor, Ganesh Sittampalam: Generic Program Transformation. Advanced Functional Programming 1998: 116-149. DOI: 10.1007/10704973_3
Here we give a new, practical algorithm for a restricted form of higher-order matching, and show how it is used in implementing the well-known fusion optimisations.

 

*[2] Oege de Moor, David Lacey, Eric Van Wyk: Universal Regular Path Queries. Higher-Order and Symbolic Computation 16(1-2): 15-35 (2003). DOI: 10.1023/A:1023063919574
Here we present an extension of logic programming to easily express such analyses as patterns on paths in a control flow graph.

 
 
 

[3] David Lacey, Oege de Moor: Imperative Program Transformation by Rewriting. CC 2001: 52-68. DOI: 10.1007/3-540-45306-7_5
It is generally thought that well-known compiler optimisations are hard to implement by rewriting, but here we show that with side conditions written in temporal logic, the formulations are elegant and easy to reason about.

 

[4] Pavel Avgustinov, Aske Simon Christensen, Laurie J. Hendren, Sascha Kuzins, Jennifer Lhoták, Ondrej Lhoták, Oege de Moor, Damien Sereni, Ganesh Sittampalam, Julian Tibble: Optimising aspectJ. PLDI 2005: 117-128. DOI: 10.1145/1065010.1065026
In this paper, for the first time the overheads of aspects are considered, and we present analyses and optimisations for eliminating those overheads.

 

*[5] Elnar Hajiyev, Mathieu Verbaere, Oege de Moor: codeQuest: Scalable Source Code Queries with Datalog. ECOOP 2006: 2-27. DOI: 10.1007/11785477_2
Here we present an implementation of Datalog that executes on top of a standard SQL database, and benchmark its performance on some common source code querying tasks.

 

*[6] Pavel Avgustinov, Elnar Hajiyev, Neil Ongkingco, Oege de Moor, Damien Sereni, Julian Tibble, Mathieu Verbaere: Semantics of static pointcuts in aspectJ. POPL 2007: 11-23. DOI: 10.1145/1190216.1190221
The pattern language for pointcuts in AspectJ (the popular aspect-oriented extension of AspectJ) did not have a published semantics: here that void is filled, by a translation to Datalog.

 
 
 

Details of the impact

Path to Impact

Semmle was founded in December 2006 by Oege de Moor through ISIS Innovation to create from scratch the novel technology that realises the potential of the research set out in Section 2 in the context of Datalog, widening the scope of application to business intelligence rather than just program analysis. Further research was done in Semmle and six patents were filed by Semmle to protect these advances after the creation of the company.Semmle's first two years were dedicated to developing its technology and products. Its first major customer was signed up in 2008 [text removed for publication].

Direct Economic Impact

[text removed for publication]. Since November 2011, Semmle has been backed by a private investor group in the San Francisco Bay Area [text removed for publication] Many of Semmle's major clients are located in the US [text removed for publication] [A].

Large software engineering organisations struggle with a lack of visibility of how outsourced development teams are performing. To create such visibility it is necessary to analyse many different data sources, including the code itself, version history, the bug database, and test results. Much of this data are graphs and hierarchies, and thus writing the appropriate analysis to get visibility is hard. The research described here has (through further innovations at Semmle) led to a query language that, for the first time, opens up all the relevant data. Semmle's technology is routinely used in software production and management at numerous clients around the globe and has had significant benefits for them, as described below.

NASA Jet Propulsion Laboratory (JPL) has highly rigorous quality requirements and makes use of all the available leading commercial products in this area to help to ensure that its software is free of errors. Writing custom checkers is key in this environment. Together with Gerard Holzmann at JPL, Semmle has used its query language to implement his well-known "Power of Ten" rules for C programming. Together with Klaus Havelund, Semmle has implemented a new Java coding standard for use at JPL, and Semmle technology was used as one of a suite of static analysis tools to help secure the safe landing of the Curiosity Rover on Mars in August 2012, a $2.5 billion space programme. As Curiosity travelled towards Mars, computer scientists at JPL continuously tested and adjusted its software. In February 2012 a JPL engineer discovered a previously undetected code defect in the critical Entry, Descent and Landing (EDL) software. The defect had no adverse effect on the functioning of the software, but it highlighted the possibility that similar issues could exist in the code with more severe consequences. Any failure in the EDL system could lead to a catastrophic failure of the entire mission. The standard set of rigorous static analysis checks performed on every build of the code could not detect the flaw, and JPL contacted Semmle directly for help. Semmle created a custom analysis that, when run on the code that controlled the spacecraft, quickly identified the known issue and a few related cases, giving the engineers time to fix the code and helping to secure a safe landing [B]. In May 2012 Gerald J. Holzmann, one of the two leaders of NASA JPL's Laboratory for Reliable Software, published a ranked assessment of static source code analysis tools on his website placing Semmle at the top with the comment: `The tool gives accurate results and, once the database is built for a new application, queries are resolved very fast. Highly recommended.' [C]

Certipost is an important player in the European market for electronic communication and document processing, used by 85,000 companies and 520,000 individuals. Originally Certipost attempted to improve software architectures with open source tooling, but this was found ineffective because it was too generic: custom analyses were required. In particular Certipost wanted to check that the state of the code was in accordance with the design diagrams; using Semmle's query language, this was achieved with a minimum of effort. In 2012 Certipost published a paper describing the way in which Semmle technology was customised with rules encoding Certipost architecture, thus enabling them to keep code at a high level of quality and in sync with the defined architecture [D]. The head of architecture and analysis at Certipost has commented: `Certipost can now cross-check code and architecture against each other and detect gaps quickly and easily. The benefit is a reduced cost of keeping these in sync, keeping key quality attributes at desired levels. Semmle has become an indispensable solution for Certipost.'

[text removed for publication]Murex is the 2nd largest independent software vendor in France. It has a large code base, which other products had difficulty processing. Semmle's efficient queries were able to process it without problems, and again custom queries that linked code and other artefacts (in this case XML descriptions of screens) were imperative. Semmle's query language also proved valuable in implementing complex, large-scale refactorings. It is estimated that Semmle has saved them a 30% reduction in developer time on structural improvement and a 10% reduction in corrective maintenance on products that use Semmle. Murex have publically stated the benefits of using Semmle technology; their COO states: "The insight provided by Semmle has been critical to the success of several improvement projects at Murex. We find the ability to formulate custom analyses a key advantage. Semmle business intelligence is essential for a clean and clear understanding of complex software engineering data." [E]

EMC Corporation is an American multinational corporation that sells data storage products and services used to build web-based computing systems. Its big data division uses Semmle to impose rigorous coding standards, especially in the mission-critical query optimiser. EMC trialled products from vendors that are much longer established, but opted for Semmle because of the greater accuracy, which is enabled by the special query language. The Director of Software Engineering at Greenplum, a subsidiary of EMC, states: `Semmle has changed my view of software engineering. Semmle ensures a tight adherence to our standard of software excellence. New team members get help in avoiding common beginners' mistakes, and experienced developers can spread their knowledge of good practice.' [F]

Sources to corroborate the impact

[A] Information about Semmle can be corroborated by Oege de Moor, its founder and CEO; information about clients is also corroborated on the Semmle website at
http://semmle.com/customers/

[B] http://semmle.com/?case-study=nasa-jet-propulsion-laboratory-jpl
This article from the SEMMLE website confirms the use of Semmle technology to help secure the safe landing of Curiosity on Mars. Also confirmed at http://gigaom.com/2012/08/20/nasa-scrubbed- mars-rover-code-clean-over-and-over/

[C] Corroboration of Gerald J. Holzmann's positive views on Semmle are found on his website at
http://spinroot.com/static/ and on Semmle's website at the end of the NASA JPL case study at
http://semmle.com/?case-study=nasa-jet-propulsion-laboratory-jpl

[D] De Schutter, K. Automated architectural reviews with Semmle. 28th IEEE International Conference on Software Maintenance (ICSM), 2012. DOI: 10.1109/ICSM.2012.6405320.
A key paper presented by Certipost at ICSM, the premiere international event in software maintenance and evolution, confirming the use of Semmle's custom analyses to avoid the common problem of architectural drift. Corroboration of the benefits to Certipost of Semmle's technology is also contained in a case study on the Semmle website at http://semmle.com/?case-study=certipost

[E] Jean-Pierre Dacher, COO of Murex and former VP of Engineering at SAP Business Objects, confirms the benefits to Murex of Semmle's technology (including the reductions in developer time and corrective maintenance) in a case study on the Semmle website at http://semmle.com/?case- study=murex

[F] Florian Waas, Director of Software Engineering, Greenplum, confirms the benefits to EMC of Semmle's technology in a case study on the Semmle website at http://semmle.com/?case- study=greenplum-a-division-of-emc