2023 SimulationDrivenAutomatedEndtoE

From GM-RKB
Jump to navigation Jump to search

Subject Headings: End-to-End Testing, Software-System Simulation.

Notes

Cited By

Quotes

Abstract

This is the first work to report on inferential testing at scale in industry. Specifically, it reports the experience of automated testing of integrity systems at Meta. We built an internal tool called ALPACAS for automated inference of end-to-end integrity tests. Integrity tests are designed to keep users safe online by checking that interventions take place when harmful behaviour occurs on a platform. ALPACAS infers not only the test input, but also the oracle, by observing production interventions to prevent harmful behaviour. This approach allows Meta to automate the process of generating integrity tests for its platforms, such as Facebook and Instagram, which consist of hundreds of millions of lines of production code. We outline the design and deployment of ALPACAS, and report results for its coverage, number of tests produced at each stage of the test inference process, and their pass rates. Specifically, we demonstrate that using ALPACAS significantly improves coverage from a manual test design for the particular aspect of integrity end-to-end testing it was applied to. Further, from a pool of 3 million data points, ALPACAS automatically yields 39 production-ready end-to-end integrity tests. We also report that the ALPACAS-inferred test suite enjoys exceptionally low flakiness for end-to-end testing with its average in-production pass rate of 99.84%.

VI. RELATED WORK

There has been a great deal of work on test automation in the last few decades, starting with early pioneering work on automated search-based software testing [19] and symbolic execution [20], both of which can be traced by to 1976 [21], [22], [23], and which have remained active topics of research ever since.

An ever-present problem throughout these five decades of research has been the lack of a test oracle [24], first formally identified and fully described by Weyuker in 1982 [25]. The Oracle Problem is concerned with determining, automatically, whether a test has passed: while we can automate the pro- cess of generating test inputs, the problem of determining which outputs should be deemed correct for giving input has remained challenging. When there is a specification of the intended behaviour, then the oracle problem is less pernicious, because it is transformed into the problem of generating oracles from specifications [26], [27]. This approach can be adopted even when the specification is written in a (structured) natural language, rather than a formal specification language [28].

However, in many cases there is no specification, and it remains impractical to construct one. Furthermore, for some systems it has proved to be impossible to fully specify the expected behaviour. Simulations are one such category of systems: we use simulations, in general, to discover, inter alia, outcomes that may be unexpected. As such, simulation-based systems denote the epitome of Weyuker’s ‘untestable’ programs [25]: not only is the specification typically not known, but it is inherently unknowable. If we could accurately predict the real behaviour we would not need a simulation, so we cannot expect to have a fully automated oracle for a simulation-based system.

For regression testing, the output from the previous version of the system can be used as a partial oracle for subsequent executions [29]. However, this leaves open the problem of determining which aspects of previous version’s behaviour are salient for the regression test problem in hand. It also clearly cannot cater for situations in which the change in question seeks to fix previously buggy behaviour. Finally, regression testing is only one part of the overall test obligations that rest on software engineers’ shoulders. Specifically, the oracle problem remains particularly challenging for testing novel functionality, for which previous behaviour is, by definition, unavailable and thus regression testing is inapplicable.

Opportunities for extracting test information from production logs have been known for some time [30]. Full test automation (test design including inputs and expected output behaviours) is possible using the so-called implicit oracle [24], which captures known buggy behaviours like crashing and running out of memory. Such behaviours can be deemed to be incorrect behaviour, irrespective of the input used to test the system under test. At Meta Platforms, we have previously used an implicit oracle in our work on simulation-based testing using Sapienz [6] and FAUSTA [31] for example. The implicit oracle is a powerful way to overcome the oracle problem for testing problems associated with reliability, both client-side application reliability (in the case of Sapienz) and server-side reliability (in the case of FAUSTA).

However, the aim of moving beyond the implicit oracle poses considerable challenges; the automated construction of oracles requires knowing the intended behaviour of a software system as well as its actual behaviour. Technologies for automatically searching test input spaces are now relatively mature and widely deployed. Nevertheless, the problem of automatically determining acceptable outputs remains one of the final obstacles to fully automatic test case design. As a result, there has been much work on approaches to automated partial oracle discovery, synthesis and inference [32], [33].

Work has also been conducted on automated inference of test oracles from documentation [34], [35], exceptions [36], and mutation testing [37], [38]. Techniques aim to alleviate some of the human burden, have focused on either reducing human effort to define oracles [39], [40] or on providing better automated support to help humans improve oracles [41]. Another approach to reducing human oracle effort is simply to reduce the number of test inputs automatically generated, such that the human effort of checking the corresponding outputs is also reduced [42].

Metamorphic testing [43] and other ‘pseudo’ oracle approaches [44] seek to reduce oracle effort (manual or otherwise) by capturing only properties or aspects of oracles rather than the complete oracle. Metamorphic testing further reduces the cognitive burden on the human test designer, who merely has to define metamorphic relations rather than full test oracles [43]. Likely metamorphic relations can also be inferred, thereby more fully automating the metamorphic testing approach [45].

Much of this existing literature on test inference, promising though it is, has hitherto been evaluated on relatively small-scale non-industrial systems and none of the previous literature reports on results from industrial deployment at scale in the field. By contrast, ALPACAS has been deployed for integrity testing, at Meta, on systems of hundreds of millions of lines of server-side code. To the best of our knowledge, the present paper, therefore, represents the first report of an industrial deployment of automated oracle inference at scale.

A. Relation to Capture–Replay, Regression and Combinatorial testing

ALPACAS could be thought of as closely related to capture–replay test techniques [46] (especially those that rely on exe- cution logs [47]) because it constructs tests from observations. The observations could be thought of us being ‘captured’, and the test is ‘replaying’ these captured observations.

Also, like capture–replay techniques, ALPACAS is also a form of regression testing. However, compared to more general regression test generation methods [29], [48], ALPACAS does not need explicit supervision on the specific aspects to be tested, nor oracles to be explicitly specified. ALPACAS circumvents this by leveraging the human generated logs to use them as a proxy for the oracle behaviour.

Similarly, other works that rely on an implicit oracle [24] are biased by the implied conclusions (from crashes or buggy behaviours) or assumptions (about the working of the underpinning systems). Such systems are typically unable to quickly adapt in highly dynamic systems such as those at Meta Platforms. Other oracle inference processes such as explicit human specification are known to not scale well [42].

Combinatorial testing [49], [50] is complementary to AL- PACAS, because the (c, r, d, a) values inferred from the human production logs could be used to explore additional interac- tions. For example, we could seek to explore pairwise interactions of suitably selected ups are values, using combinatorial testing minimise the number of tests involved. Thus, a total interaction testing procedure could be used as an extension to ALPACAS.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 SimulationDrivenAutomatedEndtoEMark Harman
Shreshth Tuli
Kinga Bojarczuk
Natalija Gucevska
Xiao-Yu Wang
Graham Wright
Simulation-Driven Automated End-to-End Test and Oracle Inference10.48550/arXiv.2302.023742023