Duplicates in Panel Data

Duplicates in Panel Data

To estimate panel data regressions, Stata requires data to be declared as panel data. We can easily do so by:

  

     tsset id year

Where id is panel or firm identifier and year is time identifier. Due to data entry errors, sometimes  user record multiple records of the same year in a given id. In such cases, Stata returns the error code:

   

     repeated time values within panel

        r(451);

 

To find the duplicate values, it is often tedious to search for it manually. I have written a small program that identifies such repeated values and can drop duplicates if the users desires so. Download it here and copy it to 

   

    C:\ado\personal

To use it, you need to type in the Stata command line:

   

    dup id year

it will generated a variable dup_obs which assumes a value of 0 if there is no duplicates, and takes the value of 1, 2, 3, and so on depending upon the number of values duplicated. If you wish to drop duplicates (be careful not to delete good observations), type:

    dup id year, drop

If you wish to see the duplicates observations, you need to type:

    browse if dup_obs > 0

**********************