suprtool home page

Many-to-Many Links with Suprlink

By Neil Armstrong

About Suprlink

Over the years, many customers have come to us asking for the ability to do many-to-many links, instead of just one-to-one and many-to-one. François Desrochers, wrote a script to use Suprtool to achieve this, but it involved multiple passes over the same file. While it was an ingenious solution, we still were not happy with the performance.

In the office, Bob was not looking very happy. After having spent the past year focused on the management of Robelle and only occasionally getting a chance to program with Qedit, I could tell what Bob needed: some programming time.

I assigned him the task of getting many-to-many working over the Christmas holidays. Bob dug in, and the result is that Suprlink can now join files together that have multiple key value matches in each file.

Suprlink has traditionally been able to link an Input file containing many records with the same key to a Link file that has a single record per key value.

The Join command, will now link two files with many occurrences of the key value in the input file and many occurrences in the "Linking" file.

The syntax of the Join command is exactly the same as for the Link command, so a sample task would look as follows:

   +input ordhist
   +join  orders
   +output custord
   +xeq

The above task will link multiple records of the file ordhist, to the multiple records of the file in orders. This assumes that the files are sorted by a common key. In SQL terms this is known as an Inner Join.

An Outer Join, one where the keys do not necessarily have a match can be achieved by adding the "optional" keyword to the Join command:

   +input ordhist
   +join  orders optional
   +output joined
   +xeq

In SQL parlance, once again you can achieve both a Left Outer Join and Right Outer Join by reversing the order of the files, between the input and the join commands.

To give you an example of how the Join operation would work consider the following data. First we have an inventory file with multiple records for the same product-no. This data is stored in the file dinv:

   50512001  {Rest of inventory data}
   50512001  {Rest of inventory data}
   50512003  {Rest of inventory data}

The next file will have sales records, once again with multiple key values, this data is stored in the file dsales:

   50512001  {Rest of sales data}
   50512001  {Rest of sales data}

Assume you did the following task, with both files sorted by the product-no:

  +in dinv
  +join dsales
  +out invsales
  +xeq

The resulting file would have four records, with the multiple matching dinv and dsales records. The record layout would have the dinv information first followed by the dsales information.

If you add the optional keyword on the join command the resulting file would have 5 records. The matching 4 records from dinv and dsales as well as the dinv record that did not match with the numeric fields set to zero and the byte fields set to spaces.

Only one Join operation is allowed per task.

By default, Suprlink will join files base on the primary sorted key in the self-describing file. You can specify a secondary key for the files to be joined on in a similar manner to how the Link command did:

  +in orders
  +join dsales by order-no product-no
  +out ordsales
  +xeq

You can combine the Join command with several Link commands if you like. Remember, the Join operation is done first (to determine how many output records there will be). Then the Link commands are done to fill out additional columns in those records, or to drop the record if the Link is not found (and Optional is not specified).

Our performance measurements have shown that the Join command is a minimum of 36 times faster than the command file solution. This new feature is available as of Suprtool version 4.6.03.


Old Methods of Doing Many-to-Many Links

By François Desrochers

Suprlink is a Suprtool companion program that allows linking a master file to 1 or more files (up to 7) based on a common key. Although Suprlink is extremely fast, it has some constraints. One of the most common questions we get about Suprlink is:

We have 2 files that we want to link together. Each unique key value in both files can have multiple records. Suprlink only finds the first record in the Link file. How can we get Suprlink to link all the other records?
Unfortunately, Suprlink is not designed to do this. However, with a little bit of ingenuity and knowledge, this can be done with a command file on MPE or a shell script on HP-UX. For simplicity and clarity, we will simply call them scripts for the rest of this article.

The basic idea is to satisfy Suprlink's constraint. The scripts are doing so by running Suprlink multiple times on a modified set of records in each iteration.

We had this task assigned to 2 programmers and they came up with 2 different solutions.

The Single Record Approach

Since Suprlink expects to have one record per unique key value in the link file, this approach ensures the constraint is always true by extracting records from the input file one by one. The HP-UX shell script is multilnk.ux.txt and the MPE command file is multilnk.mpe.txt.

Since the script performs a link task for each record in the input file, you should use the smallest file as the input file. This way you will minimize the overall number of tasks. If you run into problems, you should list the content of multilog. This file contains some of the execution messages.

The Duplicate Keys Approach

The second approach takes advantage of Suprlink's internal logic. It reads all the records from the input file and links them with the first matching record in the link file. This means that, if you get rid of the first record of each series, the next Suprlink task is going to link with the second record in the series. The script simply repeats the link task until all the records have been used.

The HP-UX shell script is linkcmd.ux.txt and the MPE command file is linkcmd.mpe.txt.

In order to get rid of processed records, the script uses Suprtool's Duplicate Only command. This command extracts records with a duplicate key value. For example, if the link file contains:

   12345Record1
   12345Record2
   12345Record3
   54321Record1
   54321Record2
After the first Duplicate Only task, the file will contain:
   12345Record2
   12345Record3
   54321Record2
This means the script performs the same number of tasks as the largest group of records with a unique key value. In this example, the script will perform 3 Suprlink tasks as the largest key value group "12345" has 3 records. For best performance, you have to know the data. Suprtool's Duplicate None Keys Count might come in handy to get a record count of each record group. If you run into problems, you should list the content of mtomlog. This file contains some of the execution messages.

Script Usage

All scripts accept 3 parameters:

The scripts perform basic argument validation before proceeding with the link tasks. The order in which you specify the input and link file determines the performance but also the field order in the output file. You might have to process the file one more time to shuffle the fields if needed.

Also, note that the scripts work with one input file and one link file. That is different from regular Suprlink tasks which allows up to 7 link files. If you need to link more than 2 files, you should execute the script with 2 files then execute it again using the output file from the previous execution.