Will indexing the fields that I typically sort and use in selection criteria improve the processing speed of my scripts?
There is no simple yes or no answer to this question; however, understanding how DataEase processes a procedure can help you optimize your scripts. This Tech Tip tells you when and how DataEase uses a field's associated index file so you can make an informed decision as to when an index will speed up (or perhaps slow down) the processing time required to run each of your DQL Procedures.
DataEase uses a three phase process to execute a procedure that contains a Processing command (e.g., list records, modify records):
Record Selection phase - used only if there are selection criteria that includes an indexed field.
Qualification phase - used when there are selection criteria and one or more fields are not indexed.
Processing phase - used to process the data.
In a Processing procedure (one that includes a Processing command), the Processing phase is always used. Record selection and/or qualification are only used if the script includes a with statement. The availability and size of the indices in the selection criteria determine when and how DataEase processes a simple script.
Lets begin by taking a sample script and looking at how DataEase uses any available indexes.
for FAMILY MEMBERS
with STATE = "CA" and AGE STATUS = "ADULT" ;
FIRST NAME ;
Depending on which fields are indexed and their placement in the script, DataEase may or may not use an associated index file. Using just the selection criteria from the above example, lets take a look at each possibility:
with STATE = "CA" and AGE STATUS = "ADULT" ;
If both the STATE and AGE STATUS fields are indexed, DataEase normally uses both indices to generate an array of record numbers that meet each criterion. The arrays are then compared (in memory) for matching record numbers. Since this example uses an and operator, the intersection of the list is then processed by the list records portion of the script, as illustrated below:
STATE = "CA" AGE STATUS = "adult" 115392121157 Record Numbers for Processing 113688121157 11121157
In the second scenario, we use the identical with clause except that the and operator is replaced with the or operator:
with STATE = "CA" or AGE STATUS = "ADULT" ;
Once again, if both fields are indexed, DataEase uses both indices to generate two arrays of record numbers that meet each criterion. However, since we're using or instead of and, the union of the two lists is processed as shown on the next page.
STATE = "CA" AGE STATUS = "adult" 115392121157 Record Numbers for Processing 113688121157 1136538892121157
In the third scenario, using the and operator again, suppose the STATE field is indexed, but the AGE STATUS field is not:
with STATE = "CA" and AGE STATUS = "ADULT" ;
As before, DataEase uses the available index to generate an array of record numbers that meet the STATE = "CA" criterion. However, since the AGE STATUS field is not indexed, DataEase must qualify each of the records in the first array against the second criterion, as illustrated below:
STATE = "CA" and AGE STATUS = "adult"
115392121157
Each of the records selected in the array is qualified against the second criterion. Records that meet the second criterion are processed, but those that don't are discarded.
In the last scenario, suppose again that we are using the or operator and that the STATE field is indexed, the AGE STATUS field is not:
with STATE = "CA" or AGE STATUS = "ADULT" ;
In this scenario, DataEase does not use the index on the STATE field. Since every record in the FAMILY MEMBERS table must be qualified for the second criterion, there is no benefit in building an array of the matching STATE records. It is much faster for DataEase to qualify each record for both criteria simultaneously. Those records that qualify are passed on to the Processing phase of the procedure.
The order in which you list selection criteria can also have a dramatic impact on performance. If DataEase is going to use an index in selecting records, it automatically opens the index for the first field listed after the with command. For example, in the following with statement:
with STATE = "CA" and AGE STATUS = "adult"
if both fields are indexed, DataEase automatically opens the STATE field's index and builds the array of matching record numbers. If the second criteria's array is determined to be at least four times greater than the first (e.g., STATE = "CA" selects 1,000 records, while AGE STATUS = "ADULT" selects 6,000 records), DataEase does not take the time to build the second array. It is faster to simply qualify the first 1,000 records for the second criterion. If, however, the order of the fields is reversed (e.g., AGE STATUS = "ADULT" and STATE = "CA"), DataEase automatically opens and uses the index on the AGE STATUS field, forcing the comparison of the two arrays in memory and potentially slowing down performance. Therefore, you should always list the fields in your selection criteria in order from most discriminating to least discriminating.
Now let's look at the original script and consider the use of indexes when you sort the procedure's output.
for FAMILY MEMBERS
with STATE = "CA" and AGE STATUS = "ADULT" ;
FIRST NAME ;
In this example, we list the LAST NAME field in order. If the LAST NAME field is indexed, DataEase only uses the index if an index was not used in the Selection phase of processing (i.e., if neither the STATE field nor the AGE STATUS fields are indexed). If DataEase does use an index in selecting records, the index on the sorted field is not used.
When using an index for sorting the records, DataEase uses only the index on the primary sort level. For example, in the script shown above, if the FIRST NAME field was also indexed and listed in order, DataEase would use the index on FIRST NAME and ignore the index on LAST NAME.
Always index the Match (key) fields in a relationship. Both the Primary and foreign keys should be indexed.
Indexes are most effective in improving DQL processing speed when they select a relatively small number of records. Therefore, indexing a choice field with two or three choices, or a Yes/No field (where either selection is likely to select a large percentage of the records in the file) can be a particularly ineffective use of an index.