Generalization

Generalizing data flow information can make it possible to analyze large data sets without losing accuracy. This section describes the process of generalizing information at each generalization tier. As a matter of course, each generalization tier uses data flow information obtained from its preceding more specific generalization tier. In this way, the basic block tier generalizes information obtained at the instruction tier, the procedure tier generalizes information obtained at the basic block tier, and so on. The algorithms used to generalize information at each generalization tier can have a direct impact on the accuracy of the information that can be obtained when used during data flow analysis. The subject of accuracy will be addressed for each specific tier.

To obtain generalized data flow information, a set of target executable image files, or modules, must be defined. The target modules serve to define the context from which data flow information will be obtained and generalized. The general process used to accomplish this involves visiting each procedure within each module. For each procedure, data flow information is collected at the instruction tier and is then generalized to each less-specific tier. To facilitate the reachability algorithm, it is assumed that as the data flow information is collected, it is persisted in a form such that can be accessed on demand. The process described in this paper assumes a normalized database is used to contain the data flow information found at each generalization tier. In this manner, the upper limit associated with the number of target modules is tied to the amount of available persistent storage with respect to the amount required by a given data flow problem.

Before proceeding, it is important to point out that while this paper describes explicit algorithms for generalizing at each tier, it is entirely possible to substitute alternative algorithms. This serves to illustrate that the concept of generalizing information along generalization tiers is sufficiently abstract enough to support representing alternate forms of data flow and control flow information. By using different algorithms, it is possible to convey different forms of data flow relationships which vary in terms of precision and accuracy.

Subsections