Welcome Guestlogin to KGsePGregister at KGsePG email | FAQs

ECONOMETRICS BOOK

download

    1 of 1056

    ECONOMETRICS BOOK



    ECONOMETRICS BOOK - Transcript


    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    ECONOMETRIC ANALYSIS

    Q

    i

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    ii

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    FIFTH EDITION

    ECONOMETRIC ANALYSIS

    Q
    William H Greene
    New York University

    Upper Saddle River New Jersey 07458

    iii

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    CIP data to come

    Executive Editor Rod Banister Editor in Chief P J Boardman Managing Editor Gladys Soto Assistant Editor Marie McHale Editorial Assistant Lisa Amato Senior Media Project Manager Victoria Anderson Executive Marketing Manager Kathleen McLellan Marketing Assistant Christopher Bath Managing Editor Production Cynthia Regan Production Editor Michael Reynolds Production Assistant Dianne Falcone Permissions Supervisor Suzanne Grappi Associate Director Manufacturing Vinnie Scelta Cover Designer Kiwi Design Cover Photo Anthony Bannister Corbis Composition Interactive Composition Corporation Printer Binder Courier Westford Cover Printer Coral Graphics Credits and acknowledgments borrowed from other sources and reproduced with permission in this textbook appear on appropriate page within text or on page XX Copyright 2003 2000 1997 1993 by Pearson Education Inc Upper Saddle River New Jersey 07458 All rights reserved Printed in the United States of America This publication is protected by Copyright and permission should be obtained from the publisher prior to any prohibited reproduction storage in a retrieval system or transmission in any form or by any means electronic mechanical photocopying recording or likewise For information regarding permission s write to Rights and Permissions Department Pearson Education LTD Pearson Education Australia PTY Limited Pearson Education Singapore Pte Ltd Pearson Education North Asia Ltd Pearson Education Canada Ltd Pearson Educaci n de Mexico S A de C V Pearson Education Japan Pearson Education Malaysia Pte Ltd

    10 9 8 7 6 5 4 3 2 1 ISBN 0 13 066189 9

    iv

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    For Margaret and Richard Greene

    v

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    vi

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    BRIEF CONTENTS

    Q
    Chapter 1 Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Chapter 7 Chapter 8 Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Chapter 16 Chapter 17 Chapter 18 Chapter 19 Chapter 20 Chapter 21 Chapter 22 Appendix A Appendix B Appendix C Appendix D Introduction 1 The Classical Multiple Linear Regression Model 7 Least Squares 19 Finite Sample Properties of the Least Squares Estimator 41 Large Sample Properties of the Least Squares and Instrumental Variables Estimators 65 Inference and Prediction 93 Functional Form and Structural Change 116 Speci cation Analysis and Model Selection Nonlinear Regression Models 162 Nonspherical Disturbances The Generalized Regression Model 191 Heteroscedasticity 215 Serial Correlation 250 Models for Panel Data 283 Systems of Regression Equations 339 148

    Simultaneous Equations Models 378 Estimation Frameworks in Econometrics 425 Maximum Likelihood Estimation 468 The Generalized Method of Moments 525 Models with Lagged Variables 558 Time Series Models 608 Models for Discrete Choice 663 Limited Dependent Variable and Duration Models Matrix Algebra 803 Probability and Distribution Theory 845 Estimation and Inference 877 Large Sample Distribution Theory 896

    756

    vii

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    viii

    Brief Contents

    Appendix E Computation and Optimization Appendix F Data Sets Used in Applications Appendix G Statistical Tables 953 References Author Index Subject Index 959 000 000

    919 946

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    CONTENTS

    Q
    CHAPTER 1 Introduction 1 1 Econometrics 1 1 2 1 3 1 4 Econometric Modeling Data and Methodology Plan of the Book 5 7 1 1 4

    CHAPTER 2 The Classical Multiple Linear Regression Model 2 1 Introduction 7 2 2 The Linear Regression Model 7 2 3

    2 4

    Assumptions of the Classical Linear Regression Model 10 2 3 1 Linearity of the Regression Model 11 2 3 2 Full Rank 13 2 3 3 Regression 14 2 3 4 Spherical Disturbances 15 2 3 5 Data Generating Process for the Regressors 16 2 3 6 Normality 17 Summary and Conclusions 18

    CHAPTER 3 Least Squares 19 3 1 Introduction 19 3 2 Least Squares Regression 19 3 2 1 The Least Squares Coef cient Vector 20 3 2 2 Application An Investment Equation 21 3 2 3 Algebraic Aspects of The Least Squares Solution 3 2 4 Projection 24 3 3 Partitioned Regression and Partial Regression 26 3 4 3 5

    24

    3 6

    Partial Regression and Partial Correlation Coef cients 28 Goodness of Fit and the Analysis of Variance 31 3 5 1 The Adjusted R Squared and a Measure of Fit 34 3 5 2 R Squared and the Constant Term in the Model 36 3 5 3 Comparing Models 37 Summary and Conclusions 38

    ix

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    x

    Contents

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator 4 1 Introduction 41 4 2 Motivating Least Squares 42 4 2 1 The Population Orthogonality Conditions 42 4 2 2 Minimum Mean Squared Error Predictor 43 4 2 3 Minimum Variance Linear Unbiased Estimation 44 Unbiased Estimation 44

    41

    4 3 4 4 4 5 4 6 4 7

    The Variance of the Least Squares Estimator and the Gauss Markov Theorem 45 The Implications of Stochastic Regressors 47 Estimating the Variance of the Least Squares Estimator 48 The Normality Assumption and Basic Statistical Inference 50 4 7 1 Testing a Hypothesis About a Coef cient 50 4 7 2 Con dence Intervals for Parameters 52 4 7 3 Con dence Interval for a Linear Combination of Coef cients The Oaxaca Decomposition 53 4 7 4 Testing the Signi cance of the Regression 54 4 7 5 Marginal Distributions of the Test Statistics 55 Finite Sample Properties of Least Squares 55 Data Problems 56

    4 8 4 9

    4 9 1 Multicollinearity 56 4 9 2 Missing Observations 59 4 9 3 Regression Diagnostics and In uential Data Points 4 10 Summary and Conclusions 61 CHAPTER 5 5 1 5 2

    60

    Large Sample Properties of the Least Squares and Instrumental Variables Estimators 65 Introduction 65 Asymptotic Properties of the Least Squares Estimator 65 5 2 1 Consistency of the Least Squares Estimator of 66 5 2 2 Asymptotic Normality of the Least Squares Estimator 67 5 2 3 Consistency of s 2 and the Estimator of Asy Var b 69 5 2 4 Asymptotic Distribution of a Function of b The Delta Method 70 5 2 5 Asymptotic Ef ciency 70 More General Cases 72 5 3 1 Heterogeneity in the Distributions of xi 72 5 3 2 Dependent Observations 73 Instrumental Variable and Two Stage Least Squares Estimation 74 Hausman s Speci cation Test and an Application to Instrumental Variable Estimation 80

    5 3

    5 4 5 5

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    Contents

    xi

    5 6

    5 7

    Measurement Error 83 5 6 1 Least Squares Attenuation 84 5 6 2 Instrumental Variables Estimation 86 5 6 3 Proxy Variables 87 5 6 4 Application Income and Education and a Study of Twins Summary and Conclusions 90

    88

    CHAPTER 6 Inference and Prediction 93 6 1 Introduction 93 6 2 Restrictions and Nested Models 93 6 3 Two Approaches to Testing Hypotheses 95 6 3 1 The F Statistic and the Least Squares Discrepancy 95 6 3 2 The Restricted Least Squares Estimator 99 6 3 3 The Loss of Fit from Restricted Least Squares 101 6 4 Nonnormal Disturbances and Large Sample Tests 104 6 5 6 6 6 7 Testing Nonlinear Restrictions Prediction 111 Summary and Conclusions 114 116 108

    CHAPTER 7 Functional Form and Structural Change 7 1 Introduction 116 7 2

    7 3

    7 4

    7 5

    7 6

    Using Binary Variables 116 7 2 1 Binary Variables in Regression 116 7 2 2 Several Categories 117 7 2 3 Several Groupings 118 7 2 4 Threshold Effects and Categorical Variables 120 7 2 5 Spline Regression 121 Nonlinearity in the Variables 122 7 3 1 Functional Forms 122 7 3 2 Identifying Nonlinearity 124 7 3 3 Intrinsic Linearity and Identi cation 127 Modeling and Testing for a Structural Break 130 7 4 1 Different Parameter Vectors 130 7 4 2 Insuf cient Observations 131 7 4 3 Change in a Subset of Coef cients 132 7 4 4 Tests of Structural Break with Unequal Variances 133 Tests of Model Stability 134 7 5 1 Hansen s Test 134 7 5 2 Recursive Residuals and the CUSUMS Test 135 7 5 3 Predictive Test 137 7 5 4 Unknown Timing of the Structural Break 139 Summary and Conclusions 144

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    xii

    Contents

    CHAPTER 8 Speci cation Analysis and Model Selection 8 1 Introduction 148 8 2 Speci cation Analysis and Model Building 148

    148

    8 3

    8 4 8 5

    8 2 1 Bias Caused by Omission of Relevant Variables 148 8 2 2 Pretest Estimation 149 8 2 3 Inclusion of Irrelevant Variables 150 8 2 4 Model Building A General to Simple Strategy 151 Choosing Between Nonnested Models 152 8 3 1 Testing Nonnested Hypotheses 153 8 3 2 An Encompassing Model 154 8 3 3 Comprehensive Approach The J Test 154 8 3 4 The Cox Test 155 Model Selection Criteria 159 Summary and Conclusions 160

    CHAPTER 9 Nonlinear Regression Models 162 9 1 Introduction 162 9 2 Nonlinear Regression Models 162 9 2 1 Assumptions of the Nonlinear Regression Model 163 9 2 2 The Orthogonality Condition and the Sum of Squares 164 9 2 3 The Linearized Regression 165 9 2 4 Large Sample Properties of the Nonlinear Least Squares Estimator 167 9 2 5 Computing the Nonlinear Least Squares Estimator 169 9 3 Applications 171 9 3 1 A Nonlinear Consumption Function 171 9 3 2 The Box Cox Transformation 173 9 4 Hypothesis Testing and Parametric Restrictions 175 9 4 1 Signi cance Tests for Restrictions F and Wald Statistics 175 9 4 2 Tests Based on the LM Statistic 177 9 4 3 A Speci cation Test for Nonlinear Regressions The P E Test 178 Alternative Estimators for Nonlinear Regression Models 180 9 5 1 Nonlinear Instrumental Variables Estimation 181 9 5 2 Two Step Nonlinear Least Squares Estimation 183 9 5 3 Two Step Estimation of a Credit Scoring Model 186 Summary and Conclusions 189

    9 5

    9 6

    CHAPTER 10

    Nonspherical Disturbances The Generalized Regression Model 191 10 1 Introduction 191 10 2 Least Squares and Instrumental Variables Estimation 10 2 1 10 2 2 10 2 3

    192

    Finite Sample Properties of Ordinary Least Squares 193 Asymptotic Properties of Least Squares 194 Asymptotic Properties of Nonlinear Least Squares 196

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    Contents

    xiii

    10 3 10 4 10 5

    10 6 10 7

    Asymptotic Properties of the Instrumental Variables Estimator 196 Robust Estimation of Asymptotic Covariance Matrices 198 Generalized Method of Moments Estimation 201 Ef cient Estimation by Generalized Least Squares 207 10 5 1 Generalized Least Squares GLS 207 10 5 2 Feasible Generalized Least Squares 209 Maximum Likelihood Estimation 211 Summary and Conclusions 212

    10 2 4

    CHAPTER 11 Heteroscedasticity 215 11 1 Introduction 215 11 2 Ordinary Least Squares Estimation 11 2 1 11 2 2 11 2 3

    216

    Inef ciency of Least Squares 217 The Estimated Covariance Matrix of b 217 Estimating the Appropriate Covariance Matrix for Ordinary Least Squares 219 11 3 GMM Estimation of the Heteroscedastic Regression Model 221 11 4 Testing for Heteroscedasticity 222 11 4 1 White s General Test 222 11 4 2 The Goldfeld Quandt Test 223 11 4 3 The Breusch Pagan Godfrey LM Test 223 11 5 Weighted Least Squares When is Known 225 11 6 Estimation When Contains Unknown Parameters 227 11 6 1 Two Step Estimation 227 11 6 2 Maximum Likelihood Estimation 228 11 6 3 Model Based Tests for Heteroscedasticity 229 11 7 Applications 232 11 7 1 Multiplicative Heteroscedasticity 232 11 7 2 Groupwise Heteroscedasticity 235 11 8 Autoregressive Conditional Heteroscedasticity 238 The ARCH 1 Model 238 ARCH q ARCH in Mean and Generalized ARCH Models 240 11 8 3 Maximum Likelihood Estimation of the GARCH Model 11 8 4 Testing for GARCH Effects 244 11 8 5 Pseudo Maximum Likelihood Estimation 245 11 9 Summary and Conclusions 246 CHAPTER 12 Serial Correlation 250 12 1 Introduction 250 12 2 The Analysis of Time Series Data 12 3 Disturbance Processes 256 11 8 1 11 8 2

    242

    253

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    xiv

    Contents

    12 3 1 Characteristics of Disturbance Processes 256 12 3 2 AR 1 Disturbances 257 12 4 Some Asymptotic Results for Analyzing Time Series Data 259 12 4 1 Convergence of Moments The Ergodic Theorem 260 12 4 2 Convergence to Normality A Central Limit Theorem 262 12 5 Least Squares Estimation 265 12 5 1 Asymptotic Properties of Least Squares 265 12 5 2 Estimating the Variance of the Least Squares Estimator 266 12 6 GMM Estimation 268 12 7 Testing for Autocorrelation 268 12 7 1 Lagrange Multiplier Test 269 12 7 2 Box and Pierce s Test and Ljung s Re nement 269 12 7 3 The Durbin Watson Test 270 12 7 4 Testing in the Presence of a Lagged Dependent Variables 270 12 7 5 Summary of Testing Procedures 271 12 8 Ef cient Estimation When Is Known 271 12 9 Estimation When Is Unknown 273 12 9 1 AR 1 Disturbances 273 12 9 2 AR 2 Disturbances 274 12 9 3 Application Estimation of a Model with Autocorrelation 274 12 9 4 Estimation with a Lagged Dependent Variable 277 12 10 Common Factors 278 12 11 Forecasting in the Presence of Autocorrelation 12 12 Summary and Conclusions 280 CHAPTER 13 Models for Panel Data 13 1 Introduction 283 13 2 13 3 283 279

    13 4

    13 5 13 6 13 7

    Panel Data Models 283 Fixed Effects 287 13 3 1 Testing the Signi cance of the Group Effects 289 13 3 2 The Within and Between Groups Estimators 289 13 3 3 Fixed Time and Group Effects 291 13 3 4 Unbalanced Panels and Fixed Effects 293 Random Effects 293 13 4 1 Generalized Least Squares 295 13 4 2 Feasible Generalized Least Squares When Is Unknown 13 4 3 Testing for Random Effects 298 13 4 4 Hausman s Speci cation Test for the Random Effects Model 301 Instrumental Variables Estimation of the Random Effects Model GMM Estimation of Dynamic Panel Data Models 307 Nonspherical Disturbances and Robust Covariance Estimation 13 7 1 Robust Estimation of the Fixed Effects Model 314

    296

    303 314

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    Contents

    xv

    13 7 2 Heteroscedasticity in the Random Effects Model 316 13 7 3 Autocorrelation in Panel Data Models 317 13 8 Random Coef cients Models 318 13 9 Covariance Structures for Pooled Time Series Cross Sectional Data 320 13 9 1 Generalized Least Squares Estimation 321 13 9 2 Feasible GLS Estimation 322 13 9 3 Heteroscedasticity and the Classical Model 323 13 9 4 Speci cation Tests 323 13 9 5 Autocorrelation 324 13 9 6 Maximum Likelihood Estimation 326 13 9 7 Application to Grunfeld s Investment Data 329 13 9 8 Summary 333 13 10 Summary and Conclusions 334 CHAPTER 14 Systems of Regression Equations 339 14 1 Introduction 339 14 2 The Seemingly Unrelated Regressions Model 340 14 2 1 Generalized Least Squares 341 14 2 2 Seemingly Unrelated Regressions with Identical Regressors 343 14 2 3 Feasible Generalized Least Squares 344 14 2 4 Maximum Likelihood Estimation 347 14 2 5 An Application from Financial Econometrics The Capital Asset Pricing Model 351 14 2 6 Maximum Likelihood Estimation of the Seemingly Unrelated Regressions Model with a Block of Zeros in the Coef cient Matrix 357 14 2 7 Autocorrelation and Heteroscedasticity 360 14 3 Systems of Demand Equations Singular Systems 362 14 3 1 Cobb Douglas Cost Function 363 14 3 2 Flexible Functional Forms The Translog Cost Function 366 14 4 Nonlinear Systems and GMM Estimation 369 14 4 1 GLS Estimation 370 14 4 2 Maximum Likelihood Estimation 371 14 4 3 GMM Estimation 372 14 5 Summary and Conclusions 374 CHAPTER 15 Simultaneous Equations Models 378 15 1 Introduction 378 15 2 Fundamental Issues in Simultaneous Equations Models 378 15 2 1 Illustrative Systems of Equations 378 15 2 2 Endogeneity and Causality 381 15 2 3 A General Notation for Linear Simultaneous Equations Models 382 15 3 The Problem of Identi cation 385

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    xvi

    Contents

    15 3 1 15 3 2 15 3 3 15 4 15 5

    The Rank and Order Conditions for Identi cation 389 Identi cation Through Other Nonsample Information 394 Identi cation Through Covariance Restrictions The Fully Recursive Model 394 Methods of Estimation 396 Single Equation Limited Information Estimation Methods 396

    15 5 1 15 5 2 15 5 3 15 5 4 15 5 5

    15 6

    15 7 15 8 15 9

    Ordinary Least Squares 396 Estimation by Instrumental Variables 397 Two Stage Least Squares 398 GMM Estimation 400 Limited Information Maximum Likelihood and the k Class of Estimators 401 15 5 6 Two Stage Least Squares in Models That Are Nonlinear in Variables 403 System Methods of Estimation 404 15 6 1 Three Stage Least Squares 405 15 6 2 Full Information Maximum Likelihood 407 15 6 3 GMM Estimation 409 15 6 4 Recursive Systems and Exactly Identi ed Equations 411 Comparison of Methods Klein s Model I 411 Speci cation Tests 413

    Properties of Dynamic Models 415 15 9 1 Dynamic Models and Their Multipliers 415 15 9 2 Stability 417 15 9 3 Adjustment to Equilibrium 418 15 10 Summary and Conclusions 421 CHAPTER 16 Estimation Frameworks in Econometrics 425 16 1 Introduction 425 16 2 Parametric Estimation and Inference 427 16 2 1 Classical Likelihood Based Estimation 428 16 2 2 Bayesian Estimation 429 16 2 2 a Bayesian Analysis of the Classical Regression Model 430 16 2 2 b Point Estimation 434 16 2 2 c Interval Estimation 435 16 2 2 d Estimation with an Informative Prior Density 435 16 2 2 e Hypothesis Testing 437 16 2 3 Using Bayes Theorem in a Classical Estimation Problem The Latent Class Model 439 16 2 4 Hierarchical Bayes Estimation of a Random Parameters Model by Markov Chain Monte Carlo Simulation 444 16 3 Semiparametric Estimation 447 16 3 1 16 3 2 GMM Estimation in Econometrics 447 Least Absolute Deviations Estimation 448

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    Contents

    xvii

    16 4

    16 5

    16 6

    16 3 3 Partially Linear Regression 450 16 3 4 Kernel Density Methods 452 Nonparametric Estimation 453 16 4 1 Kernel Density Estimation 453 16 4 2 Nonparametric Regression 457 Properties of Estimators 460 16 5 1 Statistical Properties of Estimators 460 16 5 2 Extremum Estimators 461 16 5 3 Assumptions for Asymptotic Properties of Extremum Estimators 461 16 5 4 Asymptotic Properties of Estimators 464 16 5 5 Testing Hypotheses 465 Summary and Conclusions 466 468

    CHAPTER 17 Maximum Likelihood Estimation 17 1 Introduction 468 17 2 17 3 17 4

    17 5

    17 6

    The Likelihood Function and Identi cation of the Parameters 468 Ef cient Estimation The Principle of Maximum Likelihood 470 Properties of Maximum Likelihood Estimators 472 17 4 1 Regularity Conditions 473 17 4 2 Properties of Regular Densities 474 17 4 3 The Likelihood Equation 476 17 4 4 The Information Matrix Equality 476 17 4 5 Asymptotic Properties of the Maximum Likelihood Estimator 476 17 4 5 a Consistency 477 17 4 5 b Asymptotic Normality 478 17 4 5 c Asymptotic Ef ciency 479 17 4 5 d Invariance 480 17 4 5 e Conclusion 480 17 4 6 Estimating the Asymptotic Variance of the Maximum Likelihood Estimator 480 17 4 7 Conditional Likelihoods and Econometric Models 482 Three Asymptotically Equivalent Test Procedures 484 17 5 1 The Likelihood Ratio Test 484 17 5 2 The Wald Test 486 17 5 3 The Lagrange Multiplier Test 489 17 5 4 An Application of the Likelihood Based Test Procedures 490 Applications of Maximum Likelihood Estimation 492 17 6 1 17 6 2 17 6 3 17 6 4 The Normal Linear Regression Model 492 Maximum Likelihood Estimation of Nonlinear Regression Models 496 Nonnormal Disturbances The Stochastic Frontier Model Conditional Moment Tests of Speci cation 505

    501

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    xviii

    Contents

    17 7 17 8 17 9

    Two Step Maximum Likelihood Estimation 508 Maximum Simulated Likelihood Estimation 512 Pseudo Maximum Likelihood Estimation and Robust Asymptotic Covariance Matrices 518 17 10 Summary and Conclusions 521 CHAPTER 18 The Generalized Method of Moments 525 18 1 Introduction 525 18 2 Consistent Estimation The Method of Moments 526 18 2 1 Random Sampling and Estimating the Parameters of Distributions 527 18 2 2 Asymptotic Properties of the Method of Moments Estimator 531 18 2 3 Summary The Method of Moments 533 18 3 The Generalized Method of Moments GMM Estimator 533 18 3 1 Estimation Based on Orthogonality Conditions 534 18 3 2 Generalizing the Method of Moments 536 18 3 3 Properties of the GMM Estimator 540 18 3 4 GMM Estimation of Some Speci c Econometric Models 544 18 4 Testing Hypotheses in the GMM Framework 548 18 4 1 Testing the Validity of the Moment Restrictions 548 18 4 2 GMM Counterparts to the Wald LM and LR Tests 549 18 5 Application GMM Estimation of a Dynamic Panel Data Model of Local Government Expenditures 551 18 6 Summary and Conclusions 555 CHAPTER 19 Models with Lagged Variables 558 19 1 Introduction 558 19 2 Dynamic Regression Models 559 19 2 1 Lagged Effects in a Dynamic Model 560 19 2 2 The Lag and Difference Operators 562 19 2 3 Speci cation Search for the Lag Length 564 19 3 Simple Distributed Lag Models 565 19 3 1 Finite Distributed Lag Models 565 19 3 2 An In nite Lag Model The Geometric Lag Model 19 4 Autoregressive Distributed Lag Models 571 19 4 1 Estimation of the ARDL Model 572 19 4 2 Computation of the Lag Weights in the ARDL Model 573 19 4 3 Stability of a Dynamic Equation 573 19 4 4 Forecasting 576 19 5 Methodological Issues in the Analysis of Dynamic Models 19 5 1 An Error Correction Model 579 19 5 2 Autocorrelation 581

    566

    579

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    Contents

    xix

    19 6

    19 7

    19 5 3 Speci cation Analysis 582 19 5 4 Common Factor Restrictions 583 Vector Autoregressions 586 19 6 1 Model Forms 587 19 6 2 Estimation 588 19 6 3 Testing Procedures 589 19 6 4 Exogeneity 590 19 6 5 Testing for Granger Causality 592 19 6 6 Impulse Response Functions 593 19 6 7 Structural VARs 595 19 6 8 Application Policy Analysis with a VAR 19 6 9 VARs in Microeconomics 602 Summary and Conclusions 605 608

    596

    CHAPTER 20 Time Series Models 20 1 Introduction 608 20 2

    20 3

    20 4

    20 5

    Stationary Stochastic Processes 609 20 2 1 Autoregressive Moving Average Processes 609 20 2 2 Stationarity and Invertibility 611 20 2 3 Autocorrelations of a Stationary Stochastic Process 614 20 2 4 Partial Autocorrelations of a Stationary Stochastic Process 617 20 2 5 Modeling Univariate Time Series 619 20 2 6 Estimation of the Parameters of a Univariate Time Series 621 20 2 7 The Frequency Domain 624 20 2 7 a Theoretical Results 625 20 2 7 b Empirical Counterparts 627 Nonstationary Processes and Unit Roots 631 20 3 1 Integrated Processes and Differencing 631 20 3 2 Random Walks Trends and Spurious Regressions 632 20 3 3 Tests for Unit Roots in Economic Data 636 20 3 4 The Dickey Fuller Tests 637 20 3 5 Long Memory Models 647 Cointegration 649 20 4 1 Common Trends 653 20 4 2 Error Correction and VAR Representations 654 20 4 3 Testing for Cointegration 655 20 4 4 Estimating Cointegration Relationships 657 20 4 5 Application German Money Demand 657 20 4 5 a Cointegration Analysis and a Long Run Theoretical Model 659 20 4 5 b Testing for Model Instability 659 Summary and Conclusions 660

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    xx

    Contents

    CHAPTER 21 Models for Discrete Choice 21 1 Introduction 663 21 2 Discrete Choice Models 663 21 3

    663

    21 4

    21 5

    Models for Binary Choice 665 21 3 1 The Regression Approach 665 21 3 2 Latent Regression Index Function Models 668 21 3 3 Random Utility Models 670 Estimation and Inference in Binary Choice Models 670 21 4 1 Robust Covariance Matrix Estimation 673 21 4 2 Marginal Effects 674 21 4 3 Hypothesis Tests 676 21 4 4 Speci cation Tests for Binary Choice Models 679 21 4 4 a Omitted Variables 680 21 4 4 b Heteroscedasticity 680 21 4 4 c A Speci cation Test for Nonnested Models Testing for the Distribution 682 21 4 5 Measuring Goodness of Fit 683 21 4 6 Analysis of Proportions Data 686 Extensions of the Binary Choice Model 689 21 5 1 Random and Fixed Effects Models for Panel Data 689 21 5 1 a Random Effects Models 690 21 5 1 b Fixed Effects Models 695 21 5 2 Semiparametric Analysis 700 21 5 3 The Maximum Score Estimator MSCORE 702 21 5 4 Semiparametric Estimation 704 21 5 5 A Kernel Estimator for a Nonparametric Regression Function 706 21 5 6 Dynamic Binary Choice Models 708 Bivariate and Multivariate Probit Models 710 21 6 1 Maximum Likelihood Estimation 710 21 6 2 Testing for Zero Correlation 712 21 6 3 Marginal Effects 712 21 6 4 Sample Selection 713 21 6 5 A Multivariate Probit Model 714 21 6 6 Application Gender Economics Courses in Liberal Arts Colleges 715 Logit Models for Multiple Choices 719 21 7 1 The Multinomial Logit Model 720 21 7 2 The Conditional Logit Model 723 21 7 3 The Independence from Irrelevant Alternatives 724 21 7 4 Nested Logit Models 725 21 7 5 A Heteroscedastic Logit Model 727 21 7 6 Multinomial Models Based on the Normal Distribution 727 21 7 7 A Random Parameters Model 728

    21 6

    21 7

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    Contents

    xxi

    Application Conditional Logit Model for Travel Mode Choice 729 21 8 Ordered Data 736 21 9 Models for Count Data 740 21 9 1 Measuring Goodness of Fit 741 21 9 2 Testing for Overdispersion 743 21 9 3 Heterogeneity and the Negative Binomial Regression Model 744 21 9 4 Application The Poisson Regression Model 745 21 9 5 Poisson Models for Panel Data 747 21 9 6 Hurdle and Zero Altered Poisson Models 749 21 10 Summary and Conclusions 752 CHAPTER 22 Limited Dependent Variable and Duration Models 22 1 Introduction 756 22 2 Truncation 756 22 2 1 Truncated Distributions 757 22 2 2 Moments of Truncated Distributions 758 22 2 3 The Truncated Regression Model 760 22 3 Censored Data 761 The Censored Normal Distribution 762 The Censored Regression Tobit Model 764 Estimation 766 Some Issues in Speci cation 768 22 3 4 a Heteroscedasticity 768 22 3 4 b Misspeci cation of Prob y 0 770 22 3 4 c Nonnormality 771 22 3 4 d Conditional Moment Tests 772 22 3 5 Censoring and Truncation in Models for Counts 773 22 3 6 Application Censoring in the Tobit and Poisson Regression Models 774 22 4 The Sample Selection Model 780 22 4 1 Incidental Truncation in a Bivariate Distribution 781 22 4 2 Regression in a Model of Selection 782 22 4 3 Estimation 784 22 4 4 Treatment Effects 787 22 4 5 The Normality Assumption 789 22 4 6 Selection in Qualitative Response Models 790 22 5 Models for Duration Data 790 22 5 1 Duration Data 791 22 5 2 A Regression Like Approach Parametric Models of Duration 792 22 5 2 a Theoretical Background 792 22 5 2 b Models of the Hazard Function 793 22 5 2 c Maximum Likelihood Estimation 794 22 3 1 22 3 2 22 3 3 22 3 4 756

    21 7 8

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    xxii

    Contents

    22 5 2 d Exogenous Variables 22 5 2 e Heterogeneity 797 22 5 3 Other Approaches 798 22 6 Summary and Conclusions 801 APPENDIX A Matrix Algebra 803 A 1 Terminology 803 A 2 Algebraic Manipulation of Matrices

    796

    803

    A 3

    A 4

    A 5

    A 6

    A 2 1 Equality of Matrices 803 A 2 2 Transposition 804 A 2 3 Matrix Addition 804 A 2 4 Vector Multiplication 805 A 2 5 A Notation for Rows and Columns of a Matrix 805 A 2 6 Matrix Multiplication and Scalar Multiplication 805 A 2 7 Sums of Values 807 A 2 8 A Useful Idempotent Matrix 808 Geometry of Matrices 809 A 3 1 Vector Spaces 809 A 3 2 Linear Combinations of Vectors and Basis Vectors 811 A 3 3 Linear Dependence 811 A 3 4 Subspaces 813 A 3 5 Rank of a Matrix 814 A 3 6 Determinant of a Matrix 816 A 3 7 A Least Squares Problem 817 Solution of a System of Linear Equations 819 A 4 1 Systems of Linear Equations 819 A 4 2 Inverse Matrices 820 A 4 3 Nonhomogeneous Systems of Equations 822 A 4 4 Solving the Least Squares Problem 822 Partitioned Matrices 822 A 5 1 Addition and Multiplication of Partitioned Matrices 823 A 5 2 Determinants of Partitioned Matrices 823 A 5 3 Inverses of Partitioned Matrices 823 A 5 4 Deviations from Means 824 A 5 5 Kronecker Products 824 Characteristic Roots and Vectors 825 A 6 1 The Characteristic Equation 825 A 6 2 Characteristic Vectors 826 A 6 3 General Results for Characteristic Roots and Vectors 826 A 6 4 Diagonalization and Spectral Decomposition of a Matrix 827 A 6 5 Rank of a Matrix 827 A 6 6 Condition Number of a Matrix 829 A 6 7 Trace of a Matrix 829 A 6 8 Determinant of a Matrix 830 A 6 9 Powers of a Matrix 830

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    Contents

    xxiii

    A 6 10 Idempotent Matrices 832 A 6 11 Factoring a Matrix 832 A 6 12 The Generalized Inverse of a Matrix 833 A 7 Quadratic Forms and De nite Matrices 834 A 7 1 Nonnegative De nite Matrices 835 A 7 2 Idempotent Quadratic Forms 836 A 7 3 Comparing Matrices 836 A 8 Calculus and Matrix Algebra 837 A 8 1 Differentiation and the Taylor Series 837 A 8 2 Optimization 840 A 8 3 Constrained Optimization 842 A 8 4 Transformations 844 APPENDIX B Probability and Distribution Theory B 1 Introduction 845 B 2 Random Variables 845 845

    B 3 B 4

    B 5 B 6 B 7

    B 2 1 Probability Distributions 845 B 2 2 Cumulative Distribution Function 846 Expectations of a Random Variable 847 Some Speci c Probability Distributions 849 B 4 1 The Normal Distribution 849 B 4 2 The Chi Squared t and F Distributions 851 B 4 3 Distributions With Large Degrees of Freedom 853 B 4 4 Size Distributions The Lognormal Distribution 854 B 4 5 The Gamma and Exponential Distributions 855 B 4 6 The Beta Distribution 855 B 4 7 The Logistic Distribution 855 B 4 8 Discrete Random Variables 855 The Distribution of a Function of a Random Variable 856 Representations of a Probability Distribution 858 Joint Distributions 860

    B 7 1 Marginal Distributions 860 B 7 2 Expectations in a Joint Distribution 861 B 7 3 Covariance and Correlation 861 B 7 4 Distribution of a Function of Bivariate Random Variables 862 B 8 Conditioning in a Bivariate Distribution 864 B 8 1 Regression The Conditional Mean 864 B 8 2 Conditional Variance 865 B 8 3 Relationships Among Marginal and Conditional Moments 865 B 8 4 The Analysis of Variance 867 B 9 The Bivariate Normal Distribution 867 B 10 Multivariate Distributions 868 B 10 1 Moments 868

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    xxiv

    Contents

    B 10 2 Sets of Linear Functions 869 B 10 3 Nonlinear Functions 870 B 11 The Multivariate Normal Distribution 871 B 11 1 Marginal and Conditional Normal Distributions 871 B 11 2 The Classical Normal Linear Regression Model 872 B 11 3 Linear Functions of a Normal Vector 873 B 11 4 Quadratic Forms in a Standard Normal Vector 873 B 11 5 The F Distribution 875 B 11 6 A Full Rank Quadratic Form 875 B 11 7 Independence of a Linear and a Quadratic Form 876 APPENDIX C Estimation and Inference C 1 Introduction 877 C 2 Samples and Random Sampling C 3 Descriptive Statistics 878 877 878 882

    C 4 Statistics as Estimators Sampling Distributions C 5 Point Estimation of Parameters 885 C 5 1 Estimation in a Finite Sample 885 C 5 2 Ef cient Unbiased Estimation 888 C 6 Interval Estimation 890 C 7 Hypothesis Testing 892 C 7 1 Classical Testing Procedures 892 C 7 2 Tests Based on Con dence Intervals C 7 3 Speci cation Tests 896

    895

    APPENDIX D Large Sample Distribution Theory 896 D 1 Introduction 896 D 2 Large Sample Distribution Theory 897 D 2 1 Convergence in Probability 897 D 2 2 Other Forms of Convergence and Laws of Large Numbers D 2 3 Convergence of Functions 903 D 2 4 Convergence to a Random Variable 904 D 2 5 Convergence in Distribution Limiting Distributions 906 D 2 6 Central Limit Theorems 908 D 2 7 The Delta Method 913 D 3 Asymptotic Distributions 914 D 3 1 Asymptotic Distribution of a Nonlinear Function 916 D 3 2 Asymptotic Expectations 917 D 4 Sequences and the Order of a Sequence 918 APPENDIX E Computation and Optimization 919 E 1 Introduction 919 E 2 Data Input and Generation 920 E 2 1 Generating Pseudo Random Numbers

    900

    920

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    Contents

    xxv

    E 3 E 4 E 5

    E 6

    E 2 2 Sampling from a Standard Uniform Population 921 E 2 3 Sampling from Continuous Distributions 921 E 2 4 Sampling from a Multivariate Normal Population 922 E 2 5 Sampling from a Discrete Population 922 E 2 6 The Gibbs Sampler 922 Monte Carlo Studies 923 Bootstrapping and the Jackknife 924 Computation in Econometrics 925 E 5 1 Computing Integrals 926 E 5 2 The Standard Normal Cumulative Distribution Function E 5 3 The Gamma and Related Functions 927 E 5 4 Approximating Integrals by Quadrature 928 E 5 5 Monte Carlo Integration 929 E 5 6 Multivariate Normal Probabilities and Simulated Moments 931 E 5 7 Computing Derivatives 933 Optimization 933 E 6 1 E 6 2 E 6 3 E 6 4 E 6 5 E 6 6 Algorithms 935 Gradient Methods 935 Aspects of Maximum Likelihood Estimation 939 Optimization with Constraints 941 Some Practical Considerations 942 Examples 943 Data Sets Used in Applications Statistical Tables 959 000 000 953 946

    926

    APPENDIX F APPENDIX G References Author Index Subject Index

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    xxvi

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    P R E FA C E

    Q
    1 THE FIFTH EDITION OF ECONOMETRIC ANALYSIS Econometric Analysis is intended for a one year graduate course in econometrics for social scientists The prerequisites for this course should include calculus mathematical statistics and an introduction to econometrics at the level of say Gujarati s Basic Econometrics McGraw Hill 1995 or Wooldridge s Introductory Econometrics A Modern Approach South Western 2000 Self contained for our purposes summaries of the matrix algebra mathematical statistics and statistical theory used later in the book are given in Appendices A through D Appendix E contains a description of numerical methods that will be useful to practicing econometricians The formal presentation of econometrics begins with discussion of a fundamental pillar the linear multiple regression model in Chapters 2 through 8 Chapters 9 through 15 present familiar extensions of the single linear equation model including nonlinear regression panel data models the generalized regression model and systems of equations The linear model is usually not the sole technique used in most of the contemporary literature In view of this the expanding second half of this book is devoted to topics that will extend the linear regression model in many directions Chapters 16 through 18 present the techniques and underlying theory of estimation in econometrics including GMM and maximum likelihood estimation methods and simulation based techniques We end in the last four chapters 19 through 22 with discussions of current topics in applied econometrics including time series analysis and the analysis of discrete choice and limited dependent variable models This book has two objectives The rst is to introduce students to applied econometrics including basic techniques in regression analysis and some of the rich variety of models that are used when the linear model proves inadequate or inappropriate The second is to present students with suf cient theoretical background that they will recognize new variants of the models learned about here as merely natural extensions that t within a common body of principles Thus I have spent what might seem to be a large amount of effort explaining the mechanics of GMM estimation nonlinear least squares and maximum likelihood estimation and GARCH models To meet the second objective this book also contains a fair amount of theoretical material such as that on maximum likelihood estimation and on asymptotic results for regression models Modern software has made complicated modeling very easy to do and an understanding of the underlying theory is important I had several purposes in undertaking this revision As in the past readers continue to send me interesting ideas for my next edition It is impossible to use them all of
    xxvii

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    xxviii

    Preface

    course Because the ve volumes of the Handbook of Econometrics and two of the Handbook of Applied Econometrics already run to over 4 000 pages it is also unnecessary Nonetheless this revision is appropriate for several reasons First there are new and interesting developments in the eld particularly in the areas of microeconometrics panel data models for discrete choice and of course in time series which continues its rapid development Second I have taken the opportunity to continue ne tuning the text as the experience and shared wisdom of my readers accumulates in my les For this revision that adjustment has entailed a substantial rearrangement of the material the main purpose of that was to allow me to add the new material in a more compact and orderly way than I could have with the table of contents in the 4th edition The literature in econometrics has continued to evolve and my third objective is to grow with it This purpose is inherently dif cult to accomplish in a textbook Most of the literature is written by professionals for other professionals and this textbook is written for students who are in the early stages of their training But I do hope to provide a bridge to that literature both theoretical and applied This book is a broad survey of the eld of econometrics This eld grows continually and such an effort becomes increasingly dif cult A partial list of journals devoted at least in part if not completely to econometrics now includes the Journal of Applied Econometrics Journal of Econometrics Econometric Theory Econometric Reviews Journal of Business and Economic Statistics Empirical Economics and Econometrica Still my view has always been that the serious student of the eld must start somewhere and one can successfully seek that objective in a single textbook This text attempts to survey at an entry level enough of the elds in econometrics that a student can comfortably move from here to practice or more advanced study in one or more specialized areas At the same time I have tried to present the material in suf cient generality that the reader is also able to appreciate the important common foundation of all these elds and to use the tools that they all employ There are now quite a few recently published texts in econometrics Several have gathered in compact elegant treatises the increasingly advanced and advancing theoretical background of econometrics Others such as this book focus more attention on applications of econometrics One feature that distinguishes this work from its predecessors is its greater emphasis on nonlinear models Davidson and MacKinnon 1993 is a noteworthy but more advanced exception Computer software now in wide use has made estimation of nonlinear models as routine as estimation of linear ones and the recent literature re ects that progression My purpose is to provide a textbook treatment that is in line with current practice The book concludes with four lengthy chapters on time series analysis discrete choice models and limited dependent variable models These nonlinear models are now the staples of the applied econometrics literature This book also contains a fair amount of material that will extend beyond many rst courses in econometrics including perhaps the aforementioned chapters on limited dependent variables the section in Chapter 22 on duration models and some of the discussions of time series and panel data models Once again I have included these in the hope of providing a bridge to the professional literature in these areas I have had one overriding purpose that has motivated all ve editions of this work For the vast majority of readers of books such as this whose ambition is to use not develop econometrics I believe that it is simply not suf cient to recite the theory of estimation hypothesis testing and econometric analysis Understanding the often subtle

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    Preface

    xxix

    background theory is extremely important But at the end of the day my purpose in writing this work and for my continuing efforts to update it in this now fth edition is to show readers how to do econometric analysis I unabashedly accept the un attering assessment of a correspondent who once likened this book to a user s guide to econometrics 2 SOFTWARE AND DATA There are many computer programs that are widely used for the computations described in this book All were written by econometricians or statisticians and in general all are regularly updated to incorporate new developments in applied econometrics A sampling of the most widely used packages and Internet home pages where you can nd information about them are E Views Gauss LIMDEP RATS SAS Shazam Stata TSP www eviews com www aptech com www limdep com www estima com www sas com shazam econ ubc ca www stata com www tspintl com QMS Irvine Calif Aptech Systems Kent Wash Econometric Software Plainview N Y Estima Evanston Ill SAS Cary N C Ken White UBC Vancouver B C Stata College Station Tex TSP International Stanford Calif

    Programs vary in size complexity cost the amount of programming required of the user and so on Journals such as The American Statistician The Journal of Applied Econometrics and The Journal of Economic Surveys regularly publish reviews of individual packages and comparative surveys of packages usually with reference to particular functionality such as panel data analysis or forecasting With only a few exceptions the computations described in this book can be carried out with any of these packages We hesitate to link this text to any of them in particular We have placed for general access a customized version of LIMDEP which was also written by the author on the website for this text http www stern nyu edu wgreene Text econometricanalysis htm LIMDEP programs used for many of the computations are posted on the sites as well The data sets used in the examples are also on the website Throughout the text these data sets are referred to TableFn m for example Table F4 1 The F refers to Appendix F at the back of the text which contains descriptions of the data sets The actual data are posted on the website with the other supplementary materials for the text The data sets are also replicated in the system format of most of the commonly used econometrics computer programs including in addition to LIMDEP SAS TSP SPSS E Views and Stata so that you can easily import them into whatever program you might be using I should also note there are now thousands of interesting websites containing software data sets papers and commentary on econometrics It would be hopeless to attempt any kind of a survey here But I do note one which is particularly agreeably structured and well targeted for readers of this book the data archive for the

    Greene 50240

    gree50240 FM

    July 10 2002

    12 51

    xxx

    Preface

    Journal of Applied Econometrics This journal publishes many papers that are precisely at the right level for readers of this text They have archived all the noncon dential data sets used in their publications since 1994 This useful archive can be found at http qed econ queensu ca jae

    3

    ACKNOWLEDGEMENTS It is a pleasure to express my appreciation to those who have in uenced this work I am grateful to Arthur Goldberger and Arnold Zellner for their encouragement guidance and always interesting correspondence Dennis Aigner and Laurits Christensen were also in uential in shaping my views on econometrics Some collaborators to the earlier editions whose contributions remain in this one include Aline Quester David Hensher and Donald Waldman The number of students and colleagues whose suggestions have helped to produce what you nd here is far too large to allow me to thank them all individually I would like to acknowledge the many reviewers of my work whose careful reading has vastly improved the book Badi Baltagi University of Houston Neal Beck University of California at San Diego Diane Belleville Columbia University Anil Bera University of Illinois John Burkett University of Rhode Island Leonard Carlson Emory University Frank Chaloupka City University of New York Chris Cornwell University of Georgia Mitali Das Columbia University Craig Depken II University of Texas at Arlington Edward Dwyer Clemson University Michael Ellis Wesleyan University Martin Evans New York University Ed Greenberg Washington University at St Louis Miguel Herce University of North Carolina K Rao Kadiyala Purdue University Tong Li Indiana University Lubomir Litov New York University William Lott University of Connecticut Edward Mathis Villanova University Mary McGarvey University of Nebraska Lincoln Ed Melnick New York University Thad Mirer State University of New York at Albany Paul Ruud University of California at Berkeley Sherrie Rhine Chicago Federal Reserve Board Terry G Seaks University of North Carolina at Greensboro Donald Snyder California State University at Los Angeles Steven Stern University of Virginia Houston Stokes University of Illinois at Chicago Dimitrios Thomakos Florida International University Paul Wachtel New York University Mark Watson Harvard University and Kenneth West University of Wisconsin My numerous discussions with B D McCullough have improved Appendix E and at the same time increased my appreciation for numerical analysis I am especially grateful to Jan Kiviet of the University of Amsterdam who subjected my third edition to a microscopic examination and provided literally scores of suggestions virtually all of which appear herein Chapters 19 and 20 have also bene ted from previous reviews by Frank Diebold B D McCullough Mary McGarvey and Nagesh Revankar I would also like to thank Rod Banister Gladys Soto Cindy Regan Mike Reynolds Marie McHale Lisa Amato and Torie Anderson at Prentice Hall for their contributions to the completion of this book As always I owe the greatest debt to my wife Lynne and to my daughters Lesley Allison Elizabeth and Julianna William H Greene

    Greene 50240

    book

    May 24 2002

    10 36

    1

    INTRODUCTION

    Q
    1 1 ECONOMETRICS In the rst issue of Econometrica the Econometric Society stated that its main object shall be to promote studies that aim at a uni cation of the theoretical quantitative and the empirical quantitative approach to economic problems and that are penetrated by constructive and rigorous thinking similar to that which has come to dominate the natural sciences But there are several aspects of the quantitative approach to economics and no single one of these aspects taken by itself should be confounded with econometrics Thus econometrics is by no means the same as economic statistics Nor is it identical with what we call general economic theory although a considerable portion of this theory has a de nitely quantitative character Nor should econometrics be taken as synonomous sic with the application of mathematics to economics Experience has shown that each of these three viewpoints that of statistics economic theory and mathematics is a necessary but not by itself a suf cient condition for a real understanding of the quantitative relations in modern economic life It is the uni cation of all three that is powerful And it is this uni cation that constitutes econometrics Frisch 1933 and his society responded to an unprecedented accumulation of statistical information They saw a need to establish a body of principles that could organize what would otherwise become a bewildering mass of data Neither the pillars nor the objectives of econometrics have changed in the years since this editorial appeared Econometrics is the eld of economics that concerns itself with the application of mathematical statistics and the tools of statistical inference to the empirical measurement of relationships postulated by economic theory 1 2 ECONOMETRIC MODELING Econometric analysis will usually begin with a statement of a theoretical proposition Consider for example a canonical application
    Example 1 1 Keynes s Consumption Function

    From Keynes s 1936 General Theory of Employment Interest and Money We shall therefore de ne what we shall call the propensity to consume as the functional relationship f between X a given level of income and C the expenditure on consumption out of the level of income so that C f X The amount that the community spends on consumption depends i partly on the amount of its income ii partly on other objective attendant circumstances and 1

    Greene 50240

    book

    May 24 2002

    10 36

    2

    CHAPTER 1 Introduction

    iii partly on the subjective needs and the psychological propensities and habits of the individuals composing it The fundamental psychological law upon which we are entitled to depend with great con dence both a priori from our knowledge of human nature and from the detailed facts of experience is that men are disposed as a rule and on the average to increase their consumption as their income increases but not by as much as the increase in their income 1 That is dC dX is positive and less than unity But apart from short period changes in the level of income it is also obvious that a higher absolute level of income will tend as a rule to widen the gap between income and consumption These reasons will lead as a rule to a greater proportion of income being saved as real income increases The theory asserts a relationship between consumption and income C f X and claims in the third paragraph that the marginal propensity to consume MPC dC dX is between 0 and 1 The nal paragraph asserts that the average propensity to consume APC C X falls as income rises or d C X dX MPC APC X 0 It follows that MPC APC The most common formulation of the consumption function is a linear relationship C X that satis es Keynes s laws if lies between zero and one and if is greater than zero These theoretical propositions provide the basis for an econometric study Given an appropriate data set we could investigate whether the theory appears to be consistent with the observed facts For example we could see whether the linear speci cation appears to be a satisfactory description of the relationship between consumption and income and if so whether is positive and is between zero and one Some issues that might be studied are 1 whether this relationship is stable through time or whether the parameters of the relationship change from one generation to the next a change in the average propensity to save 1 APC might represent a fundamental change in the behavior of consumers in the economy 2 whether there are systematic differences in the relationship across different countries and if so what explains these differences and 3 whether there are other factors that would improve the ability of the model to explain the relationship between consumption and income For example Figure 1 1 presents aggregate consumption and personal income in constant dollars for the U S for the 10 years of 1970 1979 See Appendix Table F1 1 Apparently at least super cially the data the facts are consistent with the theory The relationship appears to be linear albeit only approximately the intercept of a line that lies close to most of the points is positive and the slope is less than one although not by much

    Economic theories such as Keynes s are typically crisp and unambiguous Models of demand production and aggregate consumption all specify precise deterministic relationships Dependent and independent variables are identi ed a functional form is speci ed and in most cases at least a qualitative statement is made about the directions of effects that occur when independent variables in the model change Of course the model is only a simpli cation of reality It will include the salient features of the relationship of interest but will leave unaccounted for in uences that might well be present but are regarded as unimportant No model could hope to encompass the myriad essentially random aspects of economic life It is thus also necessary to incorporate stochastic elements As a consequence observations on a dependent variable will display variation attributable not only to differences in variables that are explicitly accounted for but also to the randomness of human behavior and the interaction of countless minor in uences that are not It is understood that the introduction of a random disturbance into a deterministic model is not intended merely to paper over its inadequacies It is
    1 Modern economists are rarely this con dent about their theories More contemporary applications generally begin from rst principles and behavioral axioms rather than simple observation

    Greene 50240

    book

    May 24 2002

    10 36

    CHAPTER 1 Introduction

    3

    950

    900

    850

    C 800

    750

    700 650 700

    750

    800

    850 X

    900

    950

    1000

    1050

    FIGURE 1 1

    Consumption Data 1970 1979

    essential to examine the results of the study in a sort of postmortem to ensure that the allegedly random unexplained factor is truly unexplainable If it is not the model is in fact inadequate The stochastic element endows the model with its statistical properties Observations on the variable s under study are thus taken to be the outcomes of a random process With a suf ciently detailed stochastic structure and adequate data the analysis will become a matter of deducing the properties of a probability distribution The tools and methods of mathematical statistics will provide the operating principles A model or theory can never truly be con rmed unless it is made so broad as to include every possibility But it may be subjected to ever more rigorous scrutiny and in the face of contradictory evidence refuted A deterministic theory will be invalidated by a single contradictory observation The introduction of stochastic elements into the model changes it from an exact statement to a probabilistic description about expected outcomes and carries with it an important implication Only a preponderance of contradictory evidence can convincingly invalidate the probabilistic model and what constitutes a preponderance of evidence is a matter of interpretation Thus the probabilistic model is less precise but at the same time more robust 2 The process of econometric analysis departs from the speci cation of a theoretical relationship We initially proceed on the optimistic assumption that we can obtain precise measurements on all the variables in a correctly speci ed model If the ideal conditions are met at every step the subsequent analysis will probably be routine Unfortunately they rarely are Some of the dif culties one can expect to encounter are the following
    2 See

    Keuzenkamp and Magnus 1995 for a lengthy symposium on testing in econometrics

    Greene 50240

    book

    May 24 2002

    10 36

    4

    CHAPTER 1 Introduction



    The data may be badly measured or may correspond only vaguely to the variables in the model The interest rate is one example Some of the variables may be inherently unmeasurable Expectations are a case in point The theory may make only a rough guess as to the correct functional form if it makes any at all and we may be forced to choose from an embarrassingly long menu of possibilities The assumed stochastic properties of the random terms in the model may be demonstrably violated which may call into question the methods of estimation and inference procedures we have used Some relevant variables may be missing from the model

    The ensuing steps of the analysis consist of coping with these problems and attempting to cull whatever information is likely to be present in such obviously imperfect data The methodology is that of mathematical statistics and economic theory The product is an econometric model

    1 3

    DATA AND METHODOLOGY The connection between underlying behavioral models and the modern practice of econometrics is increasingly strong Practitioners rely heavily on the theoretical tools of microeconomics including utility maximization pro t maximization and market equilibrium Macroeconomic model builders rely on the interactions between economic agents and policy makers The analyses are directed at subtle dif cult questions that often require intricate complicated formulations A few applications



    What are the likely effects on labor supply behavior of proposed negative income taxes Ashenfelter and Heckman 1974 Does a monetary policy regime that is strongly oriented toward controlling in ation impose a real cost in terms of lost output on the U S economy Cecchetti and Rich 2001 Did 2001 s largest federal tax cut in U S history contribute to or dampen the concurrent recession Or was it irrelevant Still to be analyzed Does attending an elite college bring an expected payoff in lifetime expected income suf cient to justify the higher tuition Krueger and Dale 2001 and Krueger 2002 Does a voluntary training program produce tangible bene ts Can these bene ts be accurately measured Angrist 2001

    Each of these analyses would depart from a formal model of the process underlying the observed data The eld of econometrics is large and rapidly growing In one dimension we can distinguish between theoretical and applied econometrics Theorists develop new techniques and analyze the consequences of applying particular methods when the assumptions that justify them are not met Applied econometricians are the users of these techniques and the analysts of data real world and simulated Of course the distinction is far from clean practitioners routinely develop new analytical tools for the purposes of

    Greene 50240

    book

    May 24 2002

    10 36

    CHAPTER 1 Introduction

    5

    the study that they are involved in This book contains a heavy dose of theory but it is directed toward applied econometrics I have attempted to survey techniques admittedly some quite elaborate and intricate that have seen wide use in the eld Another loose distinction can be made between microeconometrics and macroeconometrics The former is characterized largely by its analysis of cross section and panel data and by its focus on individual consumers rms and micro level decision makers Macroeconometrics is generally involved in the analysis of time series data usually of broad aggregates such as price levels the money supply exchange rates output and so on Once again the boundaries are not sharp The very large eld of nancial econometrics is concerned with long time series data and occasionally vast panel data sets but with a very focused orientation toward models of individual behavior The analysis of market returns and exchange rate behavior is neither macro nor microeconometric in nature or perhaps it is some of both Another application that we will examine in this text concerns spending patterns of municipalities which again rests somewhere between the two elds Applied econometric methods will be used for estimation of important quantities analysis of economic outcomes markets or individual behavior testing theories and for forecasting The last of these is an art and science in itself and fortunately the subject of a vast library of sources Though we will brie y discuss some aspects of forecasting our interest in this text will be on estimation and analysis of models The presentation where there is a distinction to be made will contain a blend of microeconometric and macroeconometric techniques and applications The rst 18 chapters of the book are largely devoted to results that form the platform of both areas Chapters 19 and 20 focus on time series modeling while Chapters 21 and 22 are devoted to methods more suited to cross sections and panels and used more frequently in microeconometrics Save for some brief applications we will not be spending much time on nancial econometrics For those with an interest in this eld I would recommend the celebrated work by Campbell Lo and Mackinlay 1997 It is also necessary to distinguish between time series analysis which is not our focus and methods that primarily use time series data The former is like forecasting a growth industry served by its own literature in many elds While we will employ some of the techniques of time series analysis we will spend relatively little time developing rst principles The techniques used in econometrics have been employed in a widening variety of elds including political methodology sociology see e g Long 1997 health economics medical research how do we handle attrition from medical treatment studies environmental economics transportation engineering and numerous others Practitioners in these elds and many more are all heavy users of the techniques described in this text

    1 4

    PLAN OF THE BOOK The remainder of this book is organized into ve parts 1 2 Chapters 2 through 9 present the classical linear and nonlinear regression models We will discuss speci cation estimation and statistical inference Chapters 10 through 15 describe the generalized regression model panel data

    Greene 50240

    book

    May 24 2002

    10 36

    6

    CHAPTER 1 Introduction

    3

    4

    5

    applications and systems of equations Chapters 16 through 18 present general results on different methods of estimation including maximum likelihood GMM and simulation methods Various estimation frameworks including non and semiparametric and Bayesian estimation are presented in Chapters 16 and 18 Chapters 19 through 22 present topics in applied econometrics Chapters 19 and 20 are devoted to topics in time series modeling while Chapters 21 and 22 are about microeconometrics discrete choice modeling and limited dependent variables Appendices A through D present background material on tools used in econometrics including matrix algebra probability and distribution theory estimation and asymptotic distribution theory Appendix E presents results on computation Appendices A through D are chapter length surveys of the tools used in econometrics Since it is assumed that the reader has some previous training in each of these topics these summaries are included primarily for those who desire a refresher or a convenient reference We do not anticipate that these appendices can substitute for a course in any of these subjects The intent of these chapters is to provide a reasonably concise summary of the results nearly all of which are explicitly used elsewhere in the book

    The data sets used in the numerical examples are described in Appendix F The actual data sets and other supplementary materials can be downloaded from the website for the text www prenhall com greene

    Greene 50240

    book

    May 24 2002

    13 34

    2

    THE CLASSICAL MULTIPLE LINEAR REGRESSION MODEL

    Q
    2 1 INTRODUCTION An econometric study begins with a set of propositions about some aspect of the economy The theory speci es a set of precise deterministic relationships among variables Familiar examples are demand equations production functions and macroeconomic models The empirical investigation provides estimates of unknown parameters in the model such as elasticities or the effects of monetary policy and usually attempts to measure the validity of the theory against the behavior of observable data Once suitably constructed the model might then be used for prediction or analysis of behavior This book will develop a large number of models and techniques used in this framework The linear regression model is the single most useful tool in the econometrician s kit Though to an increasing degree in the contemporary literature it is often only the departure point for the full analysis it remains the device used to begin almost all empirical research This chapter will develop the model The next several chapters will discuss more elaborate speci cations and complications that arise in the application of techniques that are based on the simple models presented here

    2 2

    THE LINEAR REGRESSION MODEL The multiple linear regression model is used to study the relationship between a dependent variable and one or more independent variables The generic form of the linear regression model is y f x1 x2 xK 2 1 x1 1 x2 2 xK K where y is the dependent or explained variable and x1 xK are the independent or explanatory variables One s theory will specify f x1 x2 xK This function is commonly called the population regression equation of y on x1 xK In this setting y is the regressand and xk k 1 K are the regressors or covariates The underlying theory will specify the dependent and independent variables in the model It is not always obvious which is appropriately de ned as each of these for example a demand equation quantity 1 price 2 income 3 and an inverse demand equation price 1 quantity 2 income 3 u are equally valid representations of a market For modeling purposes it will often prove useful to think in terms of autonomous variation One can conceive of movement of the independent
    7

    Greene 50240

    book

    May 24 2002

    13 34

    8

    CHAPTER 2 The Classical Multiple Linear Regression Model

    variables outside the relationships de ned by the model while movement of the dependent variable is considered in response to some independent or exogenous stimulus 1 The term is a random disturbance so named because it disturbs an otherwise stable relationship The disturbance arises for several reasons primarily because we cannot hope to capture every in uence on an economic variable in a model no matter how elaborate The net effect which can be positive or negative of these omitted factors is captured in the disturbance There are many other contributors to the disturbance in an empirical model Probably the most signi cant is errors of measurement It is easy to theorize about the relationships among precisely de ned variables it is quite another to obtain accurate measures of these variables For example the dif culty of obtaining reasonable measures of pro ts interest rates capital stocks or worse yet ows of services from capital stocks is a recurrent theme in the empirical literature At the extreme there may be no observable counterpart to the theoretical variable The literature on the permanent income model of consumption e g Friedman 1957 provides an interesting example We assume that each observation in a sample yi xi 1 xi 2 xi K i 1 n is generated by an underlying process described by yi xi 1 1 xi 2 2 xi K K i The observed value of yi is the sum of two parts a deterministic part and the random part i Our objective is to estimate the unknown parameters of the model use the data to study the validity of the theoretical propositions and perhaps use the model to predict the variable y How we proceed from here depends crucially on what we assume about the stochastic process that has led to our observations of the data in hand
    Example 2 1 Keynes s Consumption Function

    Example 1 1 discussed a model of consumption proposed by Keynes and his General Theory 1936 The theory that consumption C and income X are related certainly seems consistent with the observed facts in Figures 1 1 and 2 1 These data are in Data Table F2 1 Of course the linear function is only approximate Even ignoring the anomalous wartime years consumption and income cannot be connected by any simple deterministic relationship The linear model C X is intended only to represent the salient features of this part of the economy It is hopeless to attempt to capture every in uence in the relationship The next step is to incorporate the inherent randomness in its real world counterpart Thus we write C f X where is a stochastic element It is important not to view as a catchall for the inadequacies of the model The model including appears adequate for the data not including the war years but for 1942 1945 something systematic clearly seems to be missing Consumption in these years could not rise to rates historically consistent with these levels of income because of wartime rationing A model meant to describe consumption in this period would have to accommodate this in uence It remains to establish how the stochastic element will be incorporated in the equation The most frequent approach is to assume that it is additive Thus we recast the equation in stochastic terms C X This equation is an empirical counterpart to Keynes s theoretical model But what of those anomalous years of rationing If we were to ignore our intuition and attempt to t a line to all these data the next chapter will discuss at length how we should do that we might arrive at the dotted line in the gure as our best guess This line however is obviously being distorted by the rationing A more appropriate
    1 By

    this de nition it would seem that in our demand relationship only income would be an independent variable while both price and quantity would be dependent That makes sense in a market price and quantity are determined at the same time and do change only when something outside the market changes We will return to this speci c case in Chapter 15

    Greene 50240

    book

    May 24 2002

    13 34

    CHAPTER 2 The Classical Multiple Linear Regression Model

    9

    350 1950 325 1947 300 C 275 1941 250 1940 225 225
    FIGURE 2 1

    1949 1948 1946

    1945 1944 1943 1942

    250

    275

    300 X

    325

    350

    375

    Consumption Data 1940 1950

    speci cation for these data that accommodates both the stochastic nature of the data and the special circumstances of the years 1942 1945 might be one that shifts straight down in the war years C X dwaryears w where the new variable dwaryears equals one in 1942 1945 and zero in other years and w

    One of the most useful aspects of the multiple regression model is its ability to identify the independent effects of a set of variables on a dependent variable Example 2 2 describes a common application
    Example 2 2 Earnings and Education

    A number of recent studies have analyzed the relationship between earnings and education We would expect on average higher levels of education to be associated with higher incomes The simple regression model earnings 1 2 education however neglects the fact that most people have higher incomes when they are older than when they are young regardless of their education Thus 2 will overstate the marginal impact of education If age and education are positively correlated then the regression model will associate all the observed increases in income with increases in education A better speci cation would account for the effect of age as in earnings 1 2 education 3 age It is often observed that income tends to rise less rapidly in the later earning years than in the early ones To accommodate this possibility we might extend the model to earnings 1 2 education 3 age 4 age2 We would expect 3 to be positive and 4 to be negative The crucial feature of this model is that it allows us to carry out a conceptual experiment that might not be observed in the actual data In the example we might like to and could compare the earnings of two individuals of the same age with different amounts of education even if the data set does not actually contain two such individuals How education should be

    Greene 50240

    book

    May 24 2002

    13 34

    10

    CHAPTER 2 The Classical Multiple Linear Regression Model

    measured in this setting is a dif cult problem The study of the earnings of twins by Ashenfelter and Krueger 1994 which uses precisely this speci cation of the earnings equation presents an interesting approach We will examine this study in some detail in Section 5 6 4 A large literature has been devoted to an intriguing question on this subject Education is not truly independent in this setting Highly motivated individuals will choose to pursue more education for example by going to college or graduate school than others By the same token highly motivated individuals may do things that on average lead them to have higher incomes If so does a positive 2 that suggests an association between income and education really measure the effect of education on income or does it re ect the effect of some underlying effect on both variables that we have not included in our regression model We will revisit the issue in Section 22 4

    2 3

    ASSUMPTIONS OF THE CLASSICAL LINEAR REGRESSION MODEL The classical linear regression model consists of a set of assumptions about how a data set will be produced by an underlying data generating process The theory will specify a deterministic relationship between the dependent variable and the independent variables The assumptions that describe the form of the model and relationships among its parts and imply appropriate estimation and inference procedures are listed in Table 2 1
    2 3 1 LINEARITY OF THE REGRESSION MODEL

    Let the column vector xk be the n observations on variable xk k 1 K and assemble these data in an n K data matrix X In most contexts the rst column of X is assumed to be a column of 1s so that 1 is the constant term in the model Let y be the n observations y1 yn and let be the column vector containing the n disturbances

    TABLE 2 1

    Assumptions of the Classical Linear Regression Model

    A1 Linearity yi xi 1 1 xi 2 2 xi K K i The model speci es a linear relationship between y and x1 xK A2 Full rank There is no exact linear relationship among any of the independent variables in the model This assumption will be necessary for estimation of the parameters of the model A3 Exogeneity of the independent variables E i x j 1 x j 2 x j K 0 This states that the expected value of the disturbance at observation i in the sample is not a function of the independent variables observed at any observation including this one This means that the independent variables will not carry useful information for prediction of i A4 Homoscedasticity and nonautocorrelation Each disturbance i has the same nite variance 2 and is uncorrelated with every other disturbance j This assumption limits the generality of the model and we will want to examine how to relax it in the chapters to follow A5 Exogenously generated data The data in x j 1 x j 2 x j K may be any mixture of constants and random variables The process generating the data operates outside the assumptions of the model that is independently of the process that generates i Note that this extends A3 Analysis is done conditionally on the observed X A6 Normal distribution The disturbances are normally distributed Once again this is a convenience that we will dispense with after some analysis of its implications

    Greene 50240

    book

    May 24 2002

    13 34

    CHAPTER 2 The Classical Multiple Linear Regression Model

    11

    The model in 2 1 as it applies to all n observations can now be written y x1 1 x K K or in the form of Assumption 1 ASSUMPTION y X 2 3 2 2

    A NOTATIONAL CONVENTION Henceforth to avoid a possibly confusing and cumbersome notation we will use a boldface x to denote a column or a row of X Which applies will be clear from the context In 2 2 xk is the kth column of X Subscripts j and k will be used to denote columns variables It will often be convenient to refer to a single observation in 2 3 which we would write yi xi i 2 4

    Subscripts i and t will generally be used to denote rows observations of X In 2 4 xi is a column vector that is the transpose of the ith 1 K row of X

    Our primary interest is in estimation and inference about the parameter vector Note that the simple regression model in Example 2 1 is a special case in which X has only two columns the rst of which is a column of 1s The assumption of linearity of the regression model includes the additive disturbance For the regression to be linear in the sense described here it must be of the form in 2 1 either in the original variables or after some suitable transformation For example the model y Ax e is linear after taking logs on both sides of the equation whereas y Ax is not The observed dependent variable is thus the sum of two components a deterministic element x and a random variable It is worth emphasizing that neither of the two parts is directly observed because and are unknown The linearity assumption is not so narrow as it might rst appear In the regression context linearity refers to the manner in which the parameters and the disturbance enter the equation not necessarily to the relationship among the variables For example the equations y x y cos x y x and y ln x are all linear in some function of x by the de nition we have used here In the examples only x has been transformed but y could have been as well as in y Ax e which is a linear relationship in the logs of x and y ln y ln x The variety of functions is unlimited This aspect of the model is used in a number of commonly used functional forms For example the loglinear model is ln y 1 2 ln X2 3 ln X3 K ln XK This equation is also known as the constant elasticity form as in this equation the elasticity of y with respect to changes in x is ln y ln xk k which does not vary

    Greene 50240

    book

    May 24 2002

    13 34

    12

    CHAPTER 2 The Classical Multiple Linear Regression Model

    with xk The log linear form is often used in models of demand and production Different values of produce widely varying functions
    Example 2 3 The U S Gasoline Market

    Data on the U S gasoline market for the years 1960 1995 are given in Table F2 2 in Appendix F We will use these data to obtain among other things estimates of the income own price and cross price elasticities of demand in this market These data also present an interesting question on the issue of holding all other things constant that was suggested in Example 2 2 In particular consider a somewhat abbreviated model of per capita gasoline consumption ln G pop 1 2 ln income 3 ln priceG 4 ln Pnewcars 5 ln Pusedcars This model will provide estimates of the income and price elasticities of demand for gasoline and an estimate of the elasticity of demand with respect to the prices of new and used cars What should we expect for the sign of 4 Cars and gasoline are complementary goods so if the prices of new cars rise ceteris paribus gasoline consumption should fall Or should it If the prices of new cars rise then consumers will buy fewer of them they will keep their used cars longer and buy fewer new cars If older cars use more gasoline than newer ones then the rise in the prices of new cars would lead to higher gasoline consumption than otherwise not lower We can use the multiple regression model and the gasoline data to attempt to answer the question

    A semilog model is often used to model growth rates ln yt xt t t In this model the autonomous at least not explained by the model itself proportional per period growth rate is d ln y dt Other variations of the general form f yt g xt t will allow a tremendous variety of functional forms all of which t into our de nition of a linear model The linear regression model is sometimes interpreted as an approximation to some unknown underlying function See Section A 8 1 for discussion By this interpretation however the linear model even with quadratic terms is fairly limited in that such an approximation is likely to be useful only over a small range of variation of the independent variables The translog model discussed in Example 2 4 in contrast has proved far more effective as an approximating function
    Example 2 4 The Translog Model

    Modern studies of demand and production are usually done in the context of a exible functional form Flexible functional forms are used in econometrics because they allow analysts to model second order effects such as elasticities of substitution which are functions of the second derivatives of production cost or utility functions The linear model restricts these to equal zero whereas the log linear model e g the Cobb Douglas model restricts the interesting elasticities to the uninteresting values of 1 or 1 The most popular exible functional form is the translog model which is often interpreted as a second order approximation to an unknown functional form See Berndt and Christensen 1973 One way to derive it is as follows We rst write y g x1 x K Then ln y ln g f Since by a trivial transformation xk exp ln xk we interpret the function as a function of the logarithms of the x s Thus ln y f ln x1 ln x K

    Greene 50240

    book

    May 24 2002

    13 34

    CHAPTER 2 The Classical Multiple Linear Regression Model

    13

    Now expand this function in a second order Taylor series around the point x 1 1 1 so that at the expansion point the log of each variable is a convenient zero Then
    K

    ln y f 0
    k 1

    f ln xk ln x 0 ln xk
    K



    1 2

    K

    2 f ln xk ln xl ln x 0 ln xk ln xl
    k 1 l 1

    The disturbance in this model is assumed to embody the familiar factors and the error of approximation to the unknown function Since the function and its derivatives evaluated at the xed value 0 are constants we interpret them as the coef cients and write
    K

    ln y 0
    k 1

    k ln xk

    1 2

    K

    K

    kl ln xk ln xl
    k 1 l 1

    This model is linear by our de nition but can in fact mimic an impressive amount of curvature when it is used to approximate another function An interesting feature of this formulation is that the log linear model is a special case kl 0 Also there is an interesting test of the underlying theory possible because if the underlying function were assumed to be continuous and twice continuously differentiable then by Young s theorem it must be true that kl l k We will see in Chapter 14 how this feature is studied in practice

    Despite its great exibility the linear model does not include all the situations we encounter in practice For a simple example there is no transformation that will reduce y 1 1 2 x to linearity The methods we consider in this chapter are not appropriate for estimating the parameters of such a model Relatively straightforward techniques have been developed for nonlinear models such as this however We shall treat them in detail in Chapter 9
    2 3 2 FULL RANK

    Assumption 2 is that there are no exact linear relationships among the variables ASSUMPTION X is an n K matrix with rank K 2 5

    Hence X has full column rank the columns of X are linearly independent and there are at least K observations See A 42 and the surrounding text This assumption is known as an identi cation condition To see the need for this assumption consider an example
    Example 2 5 Short Rank

    Suppose that a cross section model speci es C 1 2 nonlabor income 3 salary 4 total income where total income is exactly equal to salary plus nonlabor income Clearly there is an exact linear dependency in the model Now let 2 2 a 3 3 a and 4 4 a

    Greene 50240

    book

    May 24 2002

    13 34

    14

    CHAPTER 2 The Classical Multiple Linear Regression Model

    where a is any number Then the exact same value appears on the right hand side of C if we substitute 2 3 and 4 for 2 3 and 4 Obviously there is no way to estimate the parameters of this model

    If there are fewer than K observations then X cannot have full rank Hence we make the redundant assumption that n is at least as large as K In a two variable linear model with a constant term the full rank assumption means that there must be variation in the regressor x If there is no variation in x then all our observations will lie on a vertical line This situation does not invalidate the other assumptions of the model presumably it is a aw in the data set The possibility that this suggests is that we could have drawn a sample in which there was variation in x but in this instance we did not Thus the model still applies but we cannot learn about it from the data set in hand
    2 3 3 REGRESSION

    The disturbance is assumed to have conditional expected value zero at every observation which we write as E i X 0 For the full set of observations we write Assumption 3 as E 1 X E 2 X E X 0 E n X 2 6

    ASSUMPTION

    2 7

    There is a subtle point in this discussion that the observant reader might have noted In 2 7 the left hand side states in principle that the mean of each i conditioned on all observations xi is zero This conditional mean assumption states in words that no observations on x convey information about the expected value of the disturbance It is conceivable for example in a time series setting that although xi might provide no information about E i x j at some other observation such as in the next time period might Our assumption at this point is that there is no information about E i contained in any observation x j Later when we extend the model we will study the implications of dropping this assumption See Woolridge 1995 We will also assume that the disturbances convey no information about each other That is E i 1 i 1 i 1 n 0 In sum at this point we have assumed that the disturbances are purely random draws from some population The zero conditional mean implies that the unconditional mean is also zero since E i Ex E i X Ex 0 0 Since for each i Cov E i X X Cov i X Assumption 3 implies that Cov i X 0 for all i Exercise Is the converse true In most cases the zero mean assumption is not restrictive Consider a two variable model and suppose that the mean of is 0 Then x is the same as x Letting and produces the original model For an application see the discussion of frontier production functions in Section 17 6 3

    Greene 50240

    book

    May 24 2002

    13 34

    CHAPTER 2 The Classical Multiple Linear Regression Model

    15

    But if the original model does not contain a constant term then assuming E i 0 could be substantive If E i can be expressed as a linear function of xi then as before a transformation of the model will produce disturbances with zero means But if not then the nonzero mean of the disturbances will be a substantive part of the model structure This does suggest that there is a potential problem in models without constant terms As a general rule regression models should not be speci ed without constant terms unless this is speci cally dictated by the underlying theory 2 Arguably if we have reason to specify that the mean of the disturbance is something other than zero we should build it into the systematic part of the regression leaving in the disturbance only the unknown part of Assumption 3 also implies that E y X X 2 8

    Assumptions 1 and 3 comprise the linear regression model The regression of y on X is the conditional mean E y X so that without Assumption 3 X is not the conditional mean function The remaining assumptions will more completely specify the characteristics of the disturbances in the model and state the conditions under which the sample observations on x are obtained
    2 3 4 SPHERICAL DISTURBANCES

    The fourth assumption concerns the variances and covariances of the disturbances Var i X 2 and Cov i j X 0 for all i j for all i 1 n

    Constant variance is labeled homoscedasticity Consider a model that describes the profits of rms in an industry as a function of say size Even accounting for size measured in dollar terms the pro ts of large rms will exhibit greater variation than those of smaller rms The homoscedasticity assumption would be inappropriate here Also survey data on household expenditure patterns often display marked heteroscedasticity even after accounting for income and household size Uncorrelatedness across observations is labeled generically nonautocorrelation In Figure 2 1 there is some suggestion that the disturbances might not be truly independent across observations Although the number of observations is limited it does appear that on average each disturbance tends to be followed by one with the same sign This inertia is precisely what is meant by autocorrelation and it is assumed away at this point Methods of handling autocorrelation in economic data occupy a large proportion of the literature and will be treated at length in Chapter 12 Note that nonautocorrelation does not imply that observations yi and y j are uncorrelated The assumption is that deviations of observations from their expected values are uncorrelated
    2 Models that describe rst differences of variables might well be speci ed without constants Consider y y t t 1 If there is a constant term on the right hand side of the equation then yt is a function of t which is an

    explosive regressor Models with linear time trends merit special treatment in the time series literature We will return to this issue in Chapter 19

    Greene 50240

    book

    May 24 2002

    13 34

    16

    CHAPTER 2 The Classical Multiple Linear Regression Model

    The two assumptions imply that E 1 1 X E 1 2 X E 1 n X E 2 1 X E 2 2 X E 2 n X E X E n 1 X E n 2 X E n n X 2 0 0 0 2 0 2 0 0 which we summarize in Assumption 4 ASSUMPTION E X 2 I 2 9

    By using the variance decomposition formula in B 70 we nd Var E Var X Var E X 2 I Once again we should emphasize that this assumption describes the information about the variances and covariances among the disturbances that is provided by the independent variables For the present we assume that there is none We will also drop this assumption later when we enrich the regression model We are also assuming that the disturbances themselves provide no information about the variances and covariances Although a minor issue at this point it will become crucial in our treatment of timeseries applications Models such as Var t t 1 2 t2 1 a GARCH model see Section 11 8 do not violate our conditional variance assumption but do assume that Var t t 1 Var t Disturbances that meet the twin assumptions of homoscedasticity and nonautocorrelation are sometimes called spherical disturbances 3
    2 3 5 DATA GENERATING PROCESS FOR THE REGRESSORS

    It is common to assume that xi is nonstochastic as it would be in an experimental situation Here the analyst chooses the values of the regressors and then observes yi This process might apply for example in an agricultural experiment in which yi is yield and xi is fertilizer concentration and water applied The assumption of nonstochastic regressors at this point would be a mathematical convenience With it we could use the results of elementary statistics to obtain our results by treating the vector xi simply as a known constant in the probability distribution of yi With this simpli cation Assumptions A3 and A4 would be made unconditional and the counterparts would now simply state that the probability distribution of i involves none of the constants in X Social scientists are almost never able to analyze experimental data and relatively few of their models are built around nonrandom regressors Clearly for example in
    3 The term will describe the multivariate normal distribution see B 95 If

    2 I in the multivariate normal density then the equation f x c is the formula for a ball centered at with radius in n dimensional space The name spherical is used whether or not the normal distribution is assumed sometimes the spherical normal distribution is assumed explicitly

    Greene 50240

    book

    May 24 2002

    13 34

    CHAPTER 2 The Classical Multiple Linear Regression Model

    17

    any model of the macroeconomy it would be dif cult to defend such an asymmetric treatment of aggregate data Realistically we have to allow the data on xi to be random the same as yi so an alternative formulation is to assume that xi is a random vector and our formal assumption concerns the nature of the random process that produces xi If xi is taken to be a random vector then Assumptions 1 through 4 become a statement about the joint distribution of yi and xi The precise nature of the regressor and how we view the sampling process will be a major determinant of our derivation of the statistical properties of our estimators and test statistics In the end the crucial assumption is Assumption 3 the uncorrelatedness of X and Now we do note that this alternative is not completely satisfactory either since X may well contain nonstochastic elements including a constant a time trend and dummy variables that mark speci c episodes in time This makes for an ambiguous conclusion but there is a straightforward and economically useful way out of it We will assume that X can be a mixture of constants and random variables but the important assumption is that the ultimate source of the data in X is unrelated statistically and economically to the source of ASSUMPTION X may be xed or random but it is generated by a mechanism that is unrelated to 2 10

    2 3 6

    NORMALITY

    It is convenient to assume that the disturbances are normally distributed with zero mean and constant variance That is we add normality of the distribution to Assumptions 3 and 4 ASSUMPTION X N 0 2 I 2 11

    In view of our description of the source of the conditions of the central limit theorem will generally apply at least approximately and the normality assumption will be reasonable in most settings A useful implication of Assumption 6 is that it implies that observations on i are statistically independent as well as uncorrelated See the third point in Section B 8 B 97 and B 99 Normality is often viewed as an unnecessary and possibly inappropriate addition to the regression model Except in those cases in which some alternative distribution is explicitly assumed as in the stochastic frontier model discussed in Section 17 6 3 the normality assumption is probably quite reasonable Normality is not necessary to obtain many of the results we use in multiple regression analysis although it will enable us to obtain several exact statistical results It does prove useful in constructing test statistics as shown in Section 4 7 Later it will be possible to relax this assumption and retain most of the statistical results we obtain here See Sections 5 3 5 4 and 6 4

    2 4

    SUMMARY AND CONCLUSIONS This chapter has framed the linear regression model the basic platform for model building in econometrics The assumptions of the classical regression model are summarized in Figure 2 2 which shows the two variable case

    Greene 50240

    book

    May 24 2002

    13 34

    18

    CHAPTER 2 The Classical Multiple Linear Regression Model

    E y x

    x

    E y x

    x2 N x2
    2

    E y x

    x1

    E y x

    x0

    x0
    FIGURE 2 2

    x1

    x2

    x

    The Classical Regression Model

    Key Terms and Concepts
    Autocorrelation Constant elasticity Covariate Dependent variable Deterministic relationship Disturbance Exogeneity Explained variable Explanatory variable Flexible functional form Full rank Heteroscedasticity Homoscedasticity Identi cation condition Independent variable Linear regression model Loglinear model Multiple linear regression Normally distributed Population regression

    equation
    Regressand Regression Regressor Second order effects Semilog Spherical disturbances Translog model

    model
    Nonautocorrelation Nonstochastic regressors Normality

    Greene 50240

    book

    June 3 2002

    9 52

    3

    LEAST SQUARES

    Q
    3 1 INTRODUCTION Chapter 2 de ned the linear regression model as a set of characteristics of the population that underlies an observed sample of data There are a number of different approaches to estimation of the parameters of the model For a variety of practical and theoretical reasons that we will explore as we progress through the next several chapters the method of least squares has long been the most popular Moreover in most cases in which some other estimation method is found to be preferable least squares remains the benchmark approach and often the preferred method ultimately amounts to a modi cation of least squares In this chapter we begin the analysis of this important set of results by presenting a useful set of algebraic tools

    3 2

    LEAST SQUARES REGRESSION The unknown parameters of the stochastic relation yi xi i are the objects of estimation It is necessary to distinguish between population quantities such as and i and sample estimates of them denoted b and ei The population regression is E yi xi xi whereas our estimate of E yi xi is denoted yi xi b The disturbance associated with the i th data point is i yi xi For any value of b we shall estimate i with the residual ei yi xi b From the de nitions yi xi i xi b ei These equations are summarized for the two variable regression in Figure 3 1 The population quantity is a vector of unknown parameters of the probability distribution of yi whose values we hope to estimate with our sample data yi xi i 1 n This is a problem of statistical inference It is instructive however to begin by considering the purely algebraic problem of choosing a vector b so that the tted line xi b is close to the data points The measure of closeness constitutes a tting criterion
    19

    Greene 50240

    book

    June 3 2002

    9 52

    20

    CHAPTER 3 Least Squares

    y

    x

    e a bx

    E y

    x y a bx

    x

    FIGURE 3 1

    Population and Sample Regression

    Although numerous candidates have been suggested the one used most frequently is least squares 1
    3 2 1 THE LEAST SQUARES COEFFICIENT VECTOR

    The least squares coef cient vector minimizes the sum of squared residuals
    n n

    ei20
    i 1 i 1

    yi xi b0 2

    3 1

    where b0 denotes the choice for the coef cient vector In matrix terms minimizing the sum of squares in 3 1 requires us to choose b0 to Minimizeb0 S b0 e0 e0 y Xb0 y Xb0 Expanding this gives e0 e0 y y b0 X y y Xb0 b0 X Xb0 or S b0 y y 2y Xb0 b0 X Xb0 The necessary condition for a minimum is S b0 2X y 2X Xb0 0 b0
    1 We

    3 2

    3 3

    3 4

    shall have to establish that the practical approach of tting the line as closely as possible to the data by least squares leads to estimates with good statistical properties This makes intuitive sense and is indeed the case We shall return to the statistical issues in Chapters 4 and 5

    Greene 50240

    book

    June 3 2002

    9 52

    CHAPTER 3 Least Squares

    21

    Let b be the solution Then b satis es the least squares normal equations X Xb X y 3 5

    If the inverse of X X exists which follows from the full rank assumption Assumption A2 in Section 2 3 then the solution is b X X 1 X y For this solution to minimize the sum of squares 2 S b 2X X b b must be a positive de nite matrix Let q c X Xc for some arbitrary nonzero vector c Then
    n

    3 6

    q vv
    i 1

    vi2

    where v Xc

    Unless every element of v is zero q is positive But if v could be zero then v would be a linear combination of the columns of X that equals 0 which contradicts the assumption that X has full rank Since c is arbitrary q is positive for every nonzero c which establishes that 2X X is positive de nite Therefore if X has full rank then the least squares solution b is unique and minimizes the sum of squared residuals
    3 2 2 APPLICATION AN INVESTMENT EQUATION

    To illustrate the computations in a multiple regression we consider an example based on the macroeconomic data in Data Table F3 1 To estimate an investment equation we rst convert the investment and GNP series in Table F3 1 to real terms by dividing them by the CPI and then scale the two series so that they are measured in trillions of dollars The other variables in the regression are a time trend 1 2 an interest rate and the rate of in ation computed as the percentage change in the CPI These produce the data matrices listed in Table 3 1 Consider rst a regression of real investment on a constant the time trend and real GNP which correspond to x1 x2 and x3 For reasons to be discussed in Chapter 20 this is probably not a well speci ed equation for these macroeconomic variables It will suf ce for a simple numerical example however Inserting the speci c variables of the example we have b1 n b1 b1
    iT i i Gi

    b2 b2 b2

    iT i iT i 2

    b3 b3 b3

    i Gi i T Gi i 2 i Gi



    iY i iTY ii i Gi Y i

    i T Gi i

    A solution can be obtained by rst dividing the rst equation by n and rearranging it to obtain b1 Y b2 T b3 G 0 20333 b2 8 b3 1 2873 3 7

    Greene 50240

    book

    June 3 2002

    9 52

    22

    CHAPTER 3 Least Squares

    TABLE 3 1 Real Investment Y

    Data Matrices
    Constant 1 Trend T Real GNP G Interest Rate R In ation Rate P

    0 161 0 172 0 158 0 173 0 195 0 217 0 199 y 0 163 0 195 0 231 0 257 0 259 0 225 0 241 0 204

    1 1 1 1 1 1 1 X 1 1 1 1 1 1 1 1

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    1 058 1 088 1 086 1 122 1 186 1 254 1 246 1 232 1 298 1 370 1 439 1 479 1 474 1 503 1 475

    5 16 5 87 5 95 4 88 4 50 6 44 7 83 6 25 5 50 5 46 7 46 10 28 11 77 13 42 11 02

    4 40 5 15 5 37 4 99 4 16 5 75 8 82 9 31 5 21 5 83 7 40 8 64 9 31 9 44 5 99

    Note Subsequent results are based on these values Slightly different results are obtained if the raw data in Table F3 1 are input to the computer program and transformed internally

    Insert this solution in the second and third equations and rearrange terms again to yield a set of two equations b2 b2
    i T i i T i

    T 2

    b3

    i T i i Gi

    T Gi G G 2

    i T i i Gi

    T Yi Y G Yi Y

    T Gi G b3

    3 8

    This result shows the nature of the solution for the slopes which can be computed from the sums of squares and cross products of the deviations of the variables Letting lowercase letters indicate variables measured as deviations from the sample means we nd that the least squares solutions for b2 and b3 are b2 b3
    i ti yi 2 i ti 2 i gi

    i gi yi i ti gi 1 6040 0 359609 0 066196 9 82 0 0171984 280 0 359609 9 82 2 gi2 i gi ti 2 i 0 066196 280 1 6040 9 82 280 0 359609 9 82 2 0 653723

    2 i gi yi i ti i ti yi i ti gi 2 2 2 i ti i gi i gi ti

    With these solutions in hand the intercept can now be computed using 3 7 b1 0 500639 Suppose that we just regressed investment on the constant and GNP omitting the time trend At least some of the correlation we observe in the data will be explainable because both investment and real GNP have an obvious time trend Consider how this shows up in the regression computation Denoting by byx the slope in the simple bivariate regression of variable y on a constant and the variable x we nd that the slope in this reduced regression would be byg
    i gi yi 2 i gi

    0 184078

    3 9

    Greene 50240

    book

    June 3 2002

    9 52

    CHAPTER 3 Least Squares

    23

    Now divide both the numerator and denominator in the expression for b3 by i ti2 i gi2 By manipulating it a bit and using the de nition of the sample correlation between G 2 and T rgt i gi ti 2 i gi2 i ti2 and de ning byt and btg likewise we obtain byg t byg byt btg 0 653723 2 2 1 r gt 1 r gt 3 10

    The notation byg t used on the left hand side is interpreted to mean the slope in the regression of y on g in the presence of t The slope in the multiple regression differs from that in the simple regression by including a correction that accounts for the in uence of the additional variable t on both Y and G For a striking example of this effect in the simple regression of real investment on a time trend byt 1 604 280 0 0057286 a positive number that re ects the upward trend apparent in the data But in the multiple regression after we account for the in uence of GNP on real investment the slope on the time trend is 0 0171984 indicating instead a downward trend The general result for a three variable regression in which x1 is a constant term is by2 3 by2 by3 b32 2 1 r23 3 11

    It is clear from this expression that the magnitudes of by2 3 and by2 can be quite different They need not even have the same sign 2 As a nal observation note what becomes of byg t in 3 10 if r gt equals zero The rst term becomes byg whereas the second becomes zero If G and T are not correlated then the slope in the regression of G on T btg is zero Therefore we conclude the following

    THEOREM 3 1 Orthogonal Regression If the variables in a multiple regression are not correlated i e are orthogonal then the multiple regression slopes are the same as the slopes in the individual simple regressions

    In practice you will never actually compute a multiple regression by hand or with a calculator For a regression with more than three variables the tools of matrix algebra are indispensable as is a computer Consider for example an enlarged model of investment that includes in addition to the constant time trend and GNP an interest rate and the rate of in ation Least squares requires the simultaneous solution of ve normal equations Letting X and y denote the full data matrices shown previously the normal equations in 3 5 are 15 000 120 00 19 310 120 000 1240 0 164 30 19 310 164 30 25 218 111 79 1035 9 148 98 99 770 875 60 131 22 111 79 1035 9 148 98 953 86 799 02 b1 99 770 3 0500 875 60 b2 26 004 131 22 b3 3 9926 799 02 b4 23 521 716 67 20 732 b5

    Greene 50240

    book

    June 3 2002

    9 52

    24

    CHAPTER 3 Least Squares

    The solution is b X X 1 X y 0 50907 0 01658 0 67038 0 002326 0 00009401
    3 2 3 ALGEBRAIC ASPECTS OF THE LEAST SQUARES SOLUTION

    The normal equations are X Xb X y X y Xb X e 0 3 12 Hence for every column xk of X xke 0 If the rst column of X is a column of 1s then there are three implications 1 2 3 The least squares residuals sum to zero This implication follows from x1 e i e i ei 0 The regression hyperplane passes through the point of means of the data The rst normal equation implies that y x b The mean of the tted values from the regression equals the mean of the actual values This implication follows from point 1 because the tted values are just y Xb

    It is important to note that none of these results need hold if the regression does not contain a constant term
    3 2 4 PROJECTION

    The vector of least squares residuals is e y Xb Inserting the result in 3 6 for b gives e y X X X 1 X y I X X X 1 X y My 3 14 The n n matrix M de ned in 3 14 is fundamental in regression analysis You can easily show that M is both symmetric M M and idempotent M M2 In view of 3 13 we can interpret M as a matrix that produces the vector of least squares residuals in the regression of y on X when it premultiplies any vector y It will be convenient later on to refer to this matrix as a residual maker It follows that MX 0 3 15 3 13

    One way to interpret this result is that if X is regressed on X a perfect t will result and the residuals will be zero Finally 3 13 implies that y Xb e which is the sample analog to 2 3 See Figure 3 1 as well The least squares results partition y into two parts the tted values y Xb and the residuals e See Section A 3 7 especially A 54 Since MX 0 these two parts are orthogonal Now given 3 13 y y e I M y X X X 1 X y Py 3 16

    The matrix P which is also symmetric and idempotent is a projection matrix It is the matrix formed from X such that when a vector y is premultiplied by P the result is the tted values in the least squares regression of y on X This is also the projection of

    Greene 50240

    book

    June 3 2002

    9 52

    CHAPTER 3 Least Squares

    25

    the vector y into the column space of X See Sections A3 5 and A3 7 By multiplying it out you will nd that like M P is symmetric and idempotent Given the earlier results it also follows that M and P are orthogonal PM MP 0 Finally as might be expected from 3 15 PX X As a consequence of 3 15 and 3 16 we can see that least squares partitions the vector y into two orthogonal parts y Py My projection residual The result is illustrated in Figure 3 2 for the two variable case The gray shaded plane is the column space of X The projection and residual are the orthogonal dotted rays We can also see the Pythagorean theorem at work in the sums of squares y y y P Py y M My yy ee In manipulating equations involving least squares results the following equivalent expressions for the sum of squared residuals are often useful e e y M My y My y e e y e e y y b X Xb y y b X y y y y Xb

    FIGURE 3 2

    Projection of y into the column space of X

    y

    e x1

    y

    x2

    Greene 50240

    book

    June 3 2002

    9 52

    26

    CHAPTER 3 Least Squares

    3 3

    PARTITIONED REGRESSION AND PARTIAL REGRESSION It is common to specify a multiple regression model when in fact interest centers on only one or a subset of the full set of variables Consider the earnings equation discussed in Example 2 2 Although we are primarily interested in the association of earnings and education age is of necessity included in the model The question we consider here is what computations are involved in obtaining in isolation the coef cients of a subset of the variables in a multiple regression for example the coef cient of education in the aforementioned regression Suppose that the regression involves two sets of variables X1 and X2 Thus y X X1 1 X2 2 What is the algebraic solution for b2 The normal equations are 1 2 X1 X1 X2 X1 X1 X2 X2 X2 b1 Xy 1 b2 X2 y 3 17

    A solution can be obtained by using the partitioned inverse matrix of A 74 Alternatively 1 and 2 in 3 17 can be manipulated directly to solve for b2 We rst solve 1 for b1 b1 X1 X1 1 X1 y X1 X1 1 X1 X2 b2 X1 X1 1 X1 y X2 b2 3 18

    This solution states that b1 is the set of coef cients in the regression of y on X1 minus a correction vector We digress brie y to examine an important result embedded in 3 18 Suppose that X1 X2 0 Then b1 X1 X1 1 X1 y which is simply the coef cient vector in the regression of y on X1 The general result which we have just proved is the following theorem

    THEOREM 3 2 Orthogonal Partitioned Regression In the multiple linear least squares regression of y on two sets of variables X1 and X2 if the two sets of variables are orthogonal then the separate coef cient vectors can be obtained by separate regressions of y on X1 alone and y on X2 alone

    Note that Theorem 3 2 encompasses Theorem 3 1 Now inserting 3 18 in equation 2 of 3 17 produces X2 X1 X1 X1 1 X1 y X2 X1 X1 X1 1 X1 X2 b2 X2 X2 b2 X2 y After collecting terms the solution is b2 X2 I X1 X1 X1 1 X1 X2 X2 M1 X2 1 X2 M1 y
    1

    X2 I X1 X1 X1 1 X1 y 3 19

    The matrix appearing in the parentheses inside each set of square brackets is the residual maker de ned in 3 14 in this case de ned for a regression on the columns of X1

    Greene 50240

    book

    June 3 2002

    9 52

    CHAPTER 3 Least Squares

    27

    Thus M1 X2 is a matrix of residuals each column of M1 X2 is a vector of residuals in the regression of the corresponding column of X2 on the variables in X1 By exploiting the fact that M1 like M is idempotent we can rewrite 3 19 as b2 X X 1 X y 2 2 2 where X M1 X2 2 and y M1 y 3 20

    This result is fundamental in regression analysis

    THEOREM 3 3 Frisch Waugh Theorem In the linear least squares regression of vector y on two sets of variables X1 and X2 the subvector b2 is the set of coef cients obtained when the residuals from a regression of y on X1 alone are regressed on the set of residuals obtained when each column of X2 is regressed on X1

    This process is commonly called partialing out or netting out the effect of X1 For this reason the coef cients in a multiple regression are often called the partial regression coef cients The application of this theorem to the computation of a single coef cient as suggested at the beginning of this section is detailed in the following Consider the regression of y on a set of variables X and an additional variable z Denote the coef cients b and c

    COROLLARY 3 3 1 Individual Regression Coef cients The coef cient on z in a multiple regression of y on W X z is computed as c z Mz 1 z My z z 1 z y where z and y are the residual vectors from least squares regressions of z and y on X z Mz and y My where M is de ned in 3 14

    In terms of Example 2 2 we could obtain the coef cient on education in the multiple regression by rst regressing earnings and education on age or age and age squared and then using the residuals from these regressions in a simple regression In a classic application of this latter observation Frisch and Waugh 1933 who are credited with the result noted that in a time series setting the same results were obtained whether a regression was tted with a time trend variable or the data were rst detrended by netting out the effect of time as noted earlier and using just the detrended data in a simple regression 2
    2 Recall

    our earlier investment example

    Greene 50240

    book

    June 3 2002

    9 52

    28

    CHAPTER 3 Least Squares

    As an application of these results consider the case in which X1 is i a column of 1s in the rst column of X The solution for b2 in this case will then be the slopes in a regression with a constant term The coef cient in a regression of any variable z on i is i i 1 i z z the tted values are iz and the residuals are zi z When we apply this to our previous results we nd the following

    COROLLARY 3 3 2 Regression with a Constant Term The slopes in a multiple regression that contains a constant term are obtained by transforming the data to deviations from their means and then regressing the variable y in deviation form on the explanatory variables also in deviation form

    We used this result in 3 8 Having obtained the coef cients on X2 how can we recover the coef cients on X1 the constant term One way is to repeat the exercise while reversing the roles of X1 and X2 But there is an easier way We have already solved for b2 Therefore we can use 3 18 in a solution for b1 If X1 is just a column of 1s then the rst of these produces the familiar result b1 y x 2 b2 x K bK which is used in 3 7 3 21

    3 4

    PARTIAL REGRESSION AND PARTIAL CORRELATION COEFFICIENTS The use of multiple regression involves a conceptual experiment that we might not be able to carry out in practice the ceteris paribus analysis familiar in economics To pursue Example 2 2 a regression equation relating earnings to age and education enables us to do the conceptual experiment of comparing the earnings of two individuals of the same age with different education levels even if the sample contains no such pair of individuals It is this characteristic of the regression that is implied by the term partial regression coef cients The way we obtain this result as we have seen is rst to regress income and education on age and then to compute the residuals from this regression By construction age will not have any power in explaining variation in these residuals Therefore any correlation between income and education after this purging is independent of or after removing the effect of age The same principle can be applied to the correlation between two variables To continue our example to what extent can we assert that this correlation re ects a direct relationship rather than that both income and education tend on average to rise as individuals become older To nd out we would use a partial correlation coef cient which is computed along the same lines as the partial regression coef cient In the context of our example the partial correlation coef cient between income and education

    Greene 50240

    book

    June 3 2002

    9 52

    CHAPTER 3 Least Squares

    29

    controlling for the effect of age is obtained as follows 1 2 3 y the residuals in a regression of income on a constant and age z the residuals in a regression of education on a constant and age The partial correlation r yz is the simple correlation between y and z

    This calculation might seem to require a formidable amount of computation There is however a convenient shortcut Once the multiple regression is computed the t ratio in 4 13 and 4 14 for testing the hypothesis that the coef cient equals zero e g the last column of Table 4 2 can be used to compute
    2 r yz 2 tz 2 tz degrees of freedom

    3 22

    The proof of this less than perfectly intuitive result will be useful to illustrate some results on partitioned regression and to put into context two very useful results from least squares algebra As in Corollary 3 3 1 let W denote the n K 1 regressor matrix X z and let M I X X X 1 X We assume that there is a constant term in X so that the vectors of residuals y My and z Mz will have zero sample means The squared partial correlation is
    2 r yz

    z y 2 z z y y

    Let c and u denote the coef cient on z and the vector of residuals in the multiple regression of y on W The squared t ratio in 3 22 is
    2 tz

    c2 uu W W 1 1 K 1 K n K 1



    where W W 1 1 K 1 is the K 1 last diagonal element of W W 1 The partitioned K inverse formula in A 74 can be applied to the matrix X z X z This matrix appears in 3 17 with X1 X and X2 z The result is the inverse matrix that appears in 3 19 and 3 20 which implies the rst important result

    THEOREM 3 4 Diagonal Elements of the Inverse of a Moment Matrix If W X z then the last diagonal element of W W 1 is z Mz 1 z z 1 where z Mz and M I X X X 1 X

    Note that this result generalizes the development in Section A 2 8 where X is only the constant term If we now use Corollary 3 3 1 and Theorem 3 4 for c after some manipulation we obtain
    2 2 r yz tz z y 2 2 2 tz n K 1 z y 2 u u z z r yz u u y y

    Greene 50240

    book

    June 3 2002

    9 52

    30

    CHAPTER 3 Least Squares

    where u y Xd zc is the vector of residuals when y is regressed on X and z Note that unless X z 0 d will not equal b X X 1 X y See Section 8 2 1 Moreover unless c 0 u will not equal e y Xb Now we have shown in Corollary 3 3 1 that c z z 1 z y We also have from 3 18 that the coef cients on X in the regression of y on W X z are d X X 1 X y zc b X X 1 X zc So inserting this expression for d in that for u gives u y Xb X X X 1 X zc zc e Mzc e z c Now u u e e c2 z z 2cz e But e My y and z e z y c z z Inserting this in u u gives our second useful result

    THEOREM 3 5 Change in the Sum of Squares When a Variable Is Added to a Regression If e e is the sum of squared residuals when y is regressed on X and u u is the sum of squared residuals when y is regressed on X and z then u u e e c2 z z e e 3 23 where c is the coef cient on z in the long regression and z I X X X 1 X z is the vector of residuals when z is regressed on X

    Returning to our derivation we note that e e y y and c2 z z z y 2 z z 2 Therefore u u y y 1 r yz and we have our result
    Example 3 1 Partial Correlations

    For the data the application in Section 3 2 2 the simple correlations between investment and the regressors r yk and the partial correlations r yk between investment and the four regressors given the other variables are listed in Table 3 2 As is clear from the table there is no necessary relation between the simple and partial correlation coef cients One thing worth

    TABLE 3 2

    Correlations of Investment with Other Variables
    Simple Correlation Partial Correlation

    Time GNP Interest In ation

    0 7496 0 8632 0 5871 0 4777

    0 9360 0 9680 0 5167 0 0221

    Greene 50240

    book

    June 3 2002

    9 52

    CHAPTER 3 Least Squares

    31

    noting is the signs of the coef cients The signs of the partial correlation coef cients are the same as the signs of the respective regression coef cients three of which are negative All the simple correlation coef cients are positive because of the latent effect of time

    3 5

    GOODNESS OF FIT AND THE ANALYSIS OF VARIANCE The original tting criterion the sum of squared residuals suggests a measure of the t of the regression line to the data However as can easily be veri ed the sum of squared residuals can be scaled arbitrarily just by multiplying all the values of y by the desired scale factor Since the tted values of the regression are based on the values of x we might ask instead whether variation in x is a good predictor of variation in y Figure 3 3 shows three possible cases for a simple linear regression model The measure of t described here embodies both the tting criterion and the covariation of y and x Variation of the dependent variable is de ned in terms of deviations from its mean yi y The total variation in y is the sum of squared deviations
    n

    SST
    i 1

    yi y 2

    FIGURE 3 3

    Sample Data

    1 2 1 0 8 6 4 2 0 2 2

    y

    6 4 2 0 x 0 2 4 6 No Fit 375 300 225 150 075 000 075 150 8 1 0 1 2 1 4 1 6 No Fit 8 1 0 1 2 2 4

    y

    x 2 0 Moderate Fit 2 4

    y

    x 1 8 2 0 2 2

    Greene 50240

    book

    June 3 2002

    9 52

    32

    CHAPTER 3 Least Squares

    y

    yi yi yi y yi yi y xi x y yi

    xi yi

    ei

    b xi

    x

    x
    FIGURE 3 4 Decomposition of yi

    xi

    x

    In terms of the regression equation we may write the full set of observations as y Xb e y e For an individual observation we have yi yi ei xi b ei If the regression contains a constant term then the residuals will sum to zero and the mean of the predicted values of yi will equal the mean of the actual values Subtracting y from both sides and using this result and result 2 in Section 3 2 3 gives yi y yi y ei xi x b ei Figure 3 4 illustrates the computation for the two variable regression Intuitively the regression would appear to t well if the deviations of y from its mean are more largely accounted for by deviations of x from its mean than by the residuals Since both terms in this decomposition sum to zero to quantify this t we use the sums of squares instead For the full set of observations we have M0 y M0 Xb M0 e where M0 is the n n idempotent matrix that transforms observations into deviations from sample means See Section A 2 8 The column of M0 X corresponding to the constant term is zero and since the residuals already have mean zero M0 e e Then since e M0 X e X 0 the total sum of squares is y M0 y b X M0 Xb e e Write this as total sum of squares regression sum of squares error sum of squares 3 24

    Greene 50240

    book

    June 3 2002

    9 52

    CHAPTER 3 Least Squares

    33

    or SST SSR SSE 3 25 Note that this is precisely the partitioning that appears at the end of Section 3 2 4 We can now obtain a measure of how well the regression line ts the data by using the coef cient of determination SSR b X M0 Xb ee 1 SST y M0 y y M0 y 3 26

    The coef cient of determination is denoted R2 As we have shown it must be between 0 and 1 and it measures the proportion of the total variation in y that is accounted for by variation in the regressors It equals zero if the regression is a horizontal line that is if all the elements of b except the constant term are zero In this case the predicted values of y are always y so deviations of x from its mean do not translate into different predictions for y As such x has no explanatory power The other extreme R2 1 occurs if the values of x and y all lie in the same hyperplane on a straight line for a two variable regression so that the residuals are all zero If all the values of yi lie on a vertical line then R2 has no meaning and cannot be computed Regression analysis is often used for forecasting In this case we are interested in how well the regression model predicts movements in the dependent variable With this in mind an equivalent way to compute R2 is also useful First b X M0 Xb y M0 y but y Xb y y e M0 e e and X e 0 so y M0 y y M0 y Multiply R2 y M0 y y M0 y y M0 y y M0 y by 1 y M0 y y M0 y to obtain R2 y yi y 2 2 2 i yi y i yi y
    i yi

    3 27

    which is the squared correlation between the observed values of y and the predictions produced by the estimated regression equation
    Example 3 2 Fit of a Consumption Function

    The data plotted in Figure 2 1 are listed in Appendix Table F2 1 For these data where y is C and x is X we have y 273 2727 x 323 2727 Syy 12 618 182 Sxx 12 300 182 Sxy 8 423 182 so SST 12 618 182 b 8 423 182 12 300 182 0 6848014 SSR b2 Sxx 5 768 2068 and SSE SST SSR 6 849 975 Then R2 b2 Sxx SST 0 457135 As can be seen in Figure 2 1 this is a moderate t although it is not particularly good for aggregate time series data On the other hand it is clear that not accounting for the anomalous wartime data has degraded the t of the model This value is the R2 for the model indicated by the dotted line in the gure By simply omitting the years 1942 1945 from the sample and doing these computations with the remaining seven observations the heavy solid line we obtain an R2 of 0 93697 Alternatively by creating a variable WAR which equals 1 in the years 1942 1945 and zero otherwise and including this in the model which produces the model shown by the two solid lines the R2 rises to 0 94639

    We can summarize the calculation of R2 in an analysis of variance table which might appear as shown in Table 3 3
    Example 3 3 Analysis of Variance for an Investment Equation

    The analysis of variance table for the investment equation of Section 3 2 2 is given in Table 3 4

    Greene 50240

    book

    June 3 2002

    9 52

    34

    CHAPTER 3 Least Squares

    TABLE 3 3

    Analysis of Variance
    Source Degrees of Freedom Mean Square

    Regression Residual Total Coef cient of determination

    b X y n y2 ee y y n y2

    K 1 assuming a constant term n K n 1 R2 1 e e y y n y2

    s2 2 Syy n 1 s y

    TABLE 3 4

    Analysis of Variance for the Investment Equation
    Source Degrees of Freedom Mean Square

    Regression Residual Total R2 0 0159025 0 016353 0 97245

    0 0159025 0 0004508 0 016353

    4 10 14

    0 003976 0 00004508 0 0011681

    3 5 1

    THE ADJUSTED R SQUARED AND A MEASURE OF FIT

    There are some problems with the use of R2 in analyzing goodness of t The rst concerns the number of degrees of freedom used up in estimating the parameters R2 will never decrease when another variable is added to a regression equation Equation 3 23 provides a convenient means for us to establish this result Once again we are comparing a regression of y on X with sum of squared residuals e e to a regression of y on X and an additional variable z which produces sum of squared residuals u u Recall the vectors of residuals z Mz and y My e which implies that e e y y Let c be the coef cient on z in the longer regression Then c z z 1 z y and inserting this in 3 23 produces uu ee z y 2 2 e e 1 r yz z z 3 28

    where r yz is the partial correlation between y and z controlling for X Now divide 2 through both sides of the equality by y M0 y From 3 26 u u y M0 y is 1 RXz for the 2 0 regression on X and z and e e y M y is 1 RX Rearranging the result produces the following

    THEOREM 3 6 Change in R2 When a Variable Is Added to a Regression 2 Let RXz be the coef cient of determination in the regression of y on X and an 2 additional variable z let RX be the same for the regression of y on X alone and let r yz be the partial correlation between y and z controlling for X Then
    2 2 2 2 RXz RX 1 RX r yz

    3 29

    Greene 50240

    book

    June 3 2002

    9 52

    CHAPTER 3 Least Squares

    35

    Thus the R2 in the longer regression cannot be smaller It is tempting to exploit this result by just adding variables to the model R2 will continue to rise to its limit of 1 3 The adjusted R2 for degrees of freedom which incorporates a penalty for these results is computed as follows R2 1 e e n K 4 y M0 y n 1 3 30

    For computational purposes the connection between R2 and R2 is R2 1 n 1 1 R2 n K

    The adjusted R2 may decline when a variable is added to the set of independent variables Indeed R2 may even be negative To consider an admittedly extreme case suppose that x and y have a sample correlation of zero Then the adjusted R2 will equal 1 n 2 Thus the name adjusted R squared is a bit misleading as can be seen in 3 30 R2 is not actually computed as the square of any quantity Whether R2 rises or falls depends on whether the contribution of the new variable to the t of the regression more than offsets the correction for the loss of an additional degree of freedom The general result the proof of which is left as an exercise is as follows

    THEOREM 3 7 Change in R2 When a Variable Is Added to a Regression In a multiple regression R2 will fall rise when the variable x is deleted from the regression if the t ratio associated with this variable is greater less than 1

    We have shown that R2 will never fall when a variable is added to the regression We now consider this result more generally The change in the residual sum of squares when a set of variables X2 is added to the regression is e1 2 e1 2 e1 e1 b2 X2 M1 X2 b2 where we use subscript 1 to indicate the regression based on X1 alone and 1 2 to indicate the use of both X1 and X2 The coef cient vector b2 is the coef cients on X2 in the multiple regression of y on X1 and X2 See 3 19 and 3 20 for de nitions of b2 and M1 Therefore
    2 R1 2 1

    e1 e1 b2 X2 M1 X2 b2 b X M1 X 2 b 2 2 R1 2 2 0 y M0 y yM y

    3 This

    result comes at a cost however The parameter estimates become progressively less precise as we do so We will pursue this result in Chapter 4

    4 This measure is sometimes advocated on the basis of the unbiasedness of the two quantities in the fraction Since the ratio is not an unbiased estimator of any population quantity it is dif cult to justify the adjustment on this basis

    Greene 50240

    book

    June 3 2002

    9 52

    36

    CHAPTER 3 Least Squares
    2 which is greater than R1 unless b2 equals zero M1 X2 could not be zero unless X2 was a linear function of X1 in which case the regression on X1 and X2 could not be computed This equation can be manipulated a bit further to obtain 2 2 R1 2 R1

    y M1 y b2 X2 M1 X2 b2 y M0 y y M1 y

    2 But y M1 y e1 e1 so the rst term in the product is 1 R1 The second is the multiple correlation in the regression of M1 y on M1 X2 or the partial correlation after the effect of X1 is removed in the regression of y on X2 Collecting terms we have 2 2 22 R1 2 R1 1 R1 r y2 1

    This is the multivariate counterpart to 3 29 Therefore it is possible to push R2 as high as desired just by adding regressors This possibility motivates the use of the adjusted R squared in 3 30 instead of R2 as a method of choosing among alternative models Since R2 incorporates a penalty for reducing the degrees of freedom while still revealing an improvement in t one possibility is to choose the speci cation that maximizes R2 It has been suggested that the adjusted R squared does not penalize the loss of degrees of freedom heavily enough 5 Some alternatives that have been proposed for comparing models which we index by j are 2 Rj 1 n Kj 1 R2 j n Kj Kj n Kj n

    which minimizes Amemiya s 1985 prediction criterion PC j ejej n Kj 1 s2 1 j

    and the Akaike and Bayesian information criteria which are given in 8 18 and 8 19
    3 5 2 R SQUARED AND THE CONSTANT TERM IN THE MODEL

    A second dif culty with R2 concerns the constant term in the model The proof that 0 R2 1 requires X to contain a column of 1s If not then 1 M0 e e and 2 e M0 X 0 and the term 2e M0 Xb in y M0 y M0 Xb M0 e M0 Xb M0 e in the preceding expansion will not drop out Consequently when we compute R2 1 ee y M0 y

    the result is unpredictable It will never be higher and can be far lower than the same gure computed for the regression with a constant term included It can even be negative Computer packages differ in their computation of R2 An alternative computation R2 bXy y M0 y

    is equally problematic Again this calculation will differ from the one obtained with the constant term included this time R2 may be larger than 1 Some computer packages
    5 See

    for example Amemiya 1985 pp 50 51

    Greene 50240

    book

    June 3 2002

    9 52

    CHAPTER 3 Least Squares

    37

    bypass these dif culties by reporting a third R2 the squared sample correlation between the actual values of y and the tted values from the regression This approach could be deceptive If the regression contains a constant term then as we have seen all three computations give the same answer Even if not this last one will still produce a value between zero and one But it is not a proportion of variation explained On the other hand for the purpose of comparing models this squared correlation might well be a useful descriptive device It is important for users of computer packages to be aware of how the reported R2 is computed Indeed some packages will give a warning in the results when a regression is t without a constant or by some technique other than linear least squares
    3 5 3 COMPARING MODELS

    The value of R2 we obtained for the consumption function in Example 3 2 seems high in an absolute sense Is it Unfortunately there is no absolute basis for comparison In fact in using aggregate time series data coef cients of determination this high are routine In terms of the values one normally encounters in cross sections an R2 of 0 5 is relatively high Coef cients of determination in cross sections of individual data as high as 0 2 are sometimes noteworthy The point of this discussion is that whether a regression line provides a good t to a body of data depends on the setting Little can be said about the relative quality of ts of regression lines in different contexts or in different data sets even if supposedly generated by the same data generating mechanism One must be careful however even in a single context to be sure to use the same basis for comparison for competing models Usually this concern is about how the dependent variable is computed For example a perennial question concerns whether a linear or loglinear model ts the data better Unfortunately the question cannot be answered with a direct comparison An R2 for the linear regression model is different from an R2 for the loglinear model Variation in y is different from variation in ln y The latter R2 will typically be larger but this does not imply that the loglinear model is a better t in some absolute sense It is worth emphasizing that R2 is a measure of linear association between x and y For example the third panel of Figure 3 3 shows data that might arise from the model yi xi 2 i The constant allows x to be distributed about some value other than zero The relationship between y and x in this model is nonlinear and a linear regression would nd no t A nal word of caution is in order The interpretation of R2 as a proportion of variation explained is dependent on the use of least squares to compute the tted values It is always correct to write yi y yi y ei regardless of how yi is computed Thus one might use yi exp l nyi from a loglinear model in computing the sum of squares on the two sides however the cross product term vanishes only if least squares is used to compute the tted values and if the model contains a constant term Thus in in the suggested example it would still be unclear whether the linear or loglinear model ts better the cross product term has been ignored

    Greene 50240

    book

    June 3 2002

    9 52

    38

    CHAPTER 3 Least Squares

    in computing R2 for the loglinear model Only in the case of least squares applied to a linear equation with a constant term can R2 be interpreted as the proportion of variation in y explained by variation in x An analogous computation can be done without computing deviations from means if the regression does not contain a constant term Other purely algebraic artifacts will crop up in regressions without a constant however For example the value of R2 will change when the same constant is added to each observation on y but it is obvious that nothing fundamental has changed in the regression relationship One should be wary even skeptical in the calculation and interpretation of t measures for regressions without constant terms

    3 6

    SUMMARY AND CONCLUSIONS This chapter has described the purely algebraic exercise of tting a line hyperplane to a set of points using the method of least squares We considered the primary problem rst using a data set of n observations on K variables We then examined several aspects of the solution including the nature of the projection and residual maker matrices and several useful algebraic results relating to the computation of the residuals and their sum of squares We also examined the difference between gross or simple regression and correlation and multiple regression by de ning partial regression coef cients and partial correlation coef cients The Frisch Waugh Theorem 3 3 is a fundamentally useful tool in regression analysis which enables us to obtain in closed form the expression for a subvector of a vector of regression coef cients We examined several aspects of the partitioned regression including how the t of the regression model changes when variables are added to it or removed from it Finally we took a closer look at the conventional measure of how well the tted regression line predicts or ts the data

    Key Terms and Concepts
    Adjusted R squared Analysis of variance Bivariate regression Coef cient of determination Disturbance Fitting criterion Frisch Waugh theorem Goodness of t Least squares Least squares normal Moment matrix Multiple correlation Multiple regression Netting out Normal equations Orthogonal regression Partial correlation Prediction criterion Population quantity Population regression Projection Projection matrix Residual Residual maker Total variation

    coef cient
    Partial regression coef cient Partialing out Partitioned regression

    equations

    Exercises 1 The Two Variable Regression For the regression model y x a Show that the least squares normal equations imply i ei 0 and i xi ei 0 b Show that the solution for the constant term is a y bx c Show that the solution for b is b in 1 xi x yi y in 1 xi x 2

    Greene 50240

    book

    June 3 2002

    9 52

    CHAPTER 3 Least Squares

    39

    d Prove that these two values uniquely minimize the sum of squares by showing that the diagonal elements of the second derivatives matrix of the sum of squares with respect to the parameters are both positive and that the determinant is 4n in 1 xi2 nx 2 4n in 1 xi x 2 which is positive unless all values of x are the same 2 Change in the sum of squares Suppose that b is the least squares coef cient vector in the regression of y on X and that c is any other K 1 vector Prove that the difference in the two sums of squared residuals is y Xc y Xc y Xb y Xb c b X X c b Prove that this difference is positive 3 Linear Transformations of the data Consider the least squares regression of y on K variables with a constant X Consider an alternative set of regressors Z XP where P is a nonsingular matrix Thus each column of Z is a mixture of some of the columns of X Prove that the residual vectors in the regressions of y on X and y on Z are identical What relevance does this have to the question of changing the t of a regression by changing the units of measurement of the independent variables 4 Partial Frisch and Waugh In the least squares regression of y on a constant and X to compute the regression coef cients on X we can rst transform y to deviations from the mean y and likewise transform each column of X to deviations from the respective column mean second regress the transformed y on the transformed X without a constant Do we get the same result if we only transform y What if we only transform X 5 Residual makers What is the result of the matrix product M1 M where M1 is de ned in 3 19 and M is de ned in 3 14 6 Adding an observation A data set consists of n observations on Xn and yn The least squares estimator based on these n observations is bn Xn Xn 1 Xn yn Another observation xs and ys becomes available Prove that the least squares estimator computed using this additional observation is bn s bn 1 X Xn 1 xs ys xs bn 1 xs Xn Xn 1 xs n

    Note that the last term is es the residual from the prediction of ys using the coef cients based on Xn and bn Conclude that the new data change the results of least squares only if the new observation on y cannot be perfectly predicted using the information already in hand 7 Deleting an observation A common strategy for handling a case in which an observation is missing data for one or more variables is to ll those missing variables with 0s and add a variable to the model that takes the value 1 for that one observation and 0 for all other observations Show that this strategy is equivalent to discarding the observation as regards the computation of b but it does have an effect on R2 Consider the special case in which X contains only a constant and one variable Show that replacing missing values of x with the mean of the complete observations has the same effect as adding the new variable 8 Demand system estimation Let Y denote total expenditure on consumer durables nondurables and services and Ed En and Es are the expenditures on the three

    Greene 50240

    book

    June 3 2002

    9 52

    40

    CHAPTER 3 Least Squares

    categories As de ned Y Ed En Es Now consider the expenditure system Ed d d Y dd Pd dn Pn ds Ps d En n n Y nd Pd nn Pn ns Ps n Es s s Y sd Pd sn Pn ss Ps s Prove that if all equations are estimated by ordinary least squares then the sum of the expenditure coef cients will be 1 and the four other column sums in the preceding model will be zero Change in adjusted R2 Prove that the adjusted R2 in 3 30 rises falls when variable xk is deleted from the regression if the square of the t ratio on xk in the multiple regression is less greater than 1 Regression without a constant Suppose that you estimate a multiple regression rst with then without a constant Whether the R2 is higher in the second case than the rst will depend in part on how it is computed Using the relatively standard method R2 1 e e y M0 y which regression will have a higher R2 Three variables N D and Y all have zero means and unit variances A fourth variable is C N D In the regression of C on Y the slope is 0 8 In the regression of C on N the slope is 0 5 In the regression of D on Y the slope is 0 4 What is the sum of squared residuals in the regression of C on D There are 21 observations and all moments are computed using 1 n 1 as the divisor Using the matrices of sums of squares and cross products immediately preceding Section 3 2 3 compute the coef cients in the multiple regression of real investment on a constant real GNP and the interest rate Compute R2 In the December 1969 American Economic Review pp 886 896 Nathaniel Leff reports the following least squares regression results for a cross section study of the effect of age composition on savings in 74 countries in 1964 ln S Y 7 3439 0 1596 ln Y N 0 0254 ln G 1 3520 ln D1 0 3990 ln D2 ln S N 8 7851 1 1486 ln Y N 0 0265 ln G 1 3438 ln D1 0 3966 ln D2 where S Y domestic savings ratio S N per capita savings Y N per capita income D1 percentage of the population under 15 D2 percentage of the population over 64 and G growth rate of per capita income Are these results correct Explain

    9

    10

    11

    12

    13

    Greene 50240

    book

    June 3 2002

    9 57

    4

    FINITE SAMPLE PROPERTIES OF THE LEAST SQUARES ESTIMATOR

    Q
    4 1 INTRODUCTION Chapter 3 treated tting the linear regression to the data as a purely algebraic exercise We will now examine the least squares estimator from a statistical viewpoint This chapter will consider exact nite sample results such as unbiased estimation and the precise distributions of certain test statistics Some of these results require fairly strong assumptions such as nonstochastic regressors or normally distributed disturbances In the next chapter we will turn to the properties of the least squares estimator in more general cases In these settings we rely on approximations that do not hold as exact results but which do improve as the sample size increases There are other candidates for estimating In a two variable case for example we might use the intercept a and slope b of the line between the points with the largest and smallest values of x Alternatively we might nd the a and b that minimize the sum of absolute values of the residuals The question of which estimator to choose is usually based on the statistical properties of the candidates such as unbiasedness ef ciency and precision These in turn frequently depend on the particular distribution that we assume produced the data However a number of desirable properties can be obtained for the least squares estimator even without specifying a particular distribution for the disturbances in the regression In this chapter we will examine in detail the least squares as an estimator of the model parameters of the classical model de ned in the following Table 4 1 We begin in Section 4 2 by returning to the question raised but not answered in Footnote 1 Chapter 3 that is why least squares We will then analyze the estimator in detail We take Assumption A1 linearity of the model as given though in Section 4 2 we will consider brie y the possibility of a different predictor for y Assumption A2 the identi cation condition that the data matrix have full rank is considered in Section 4 9 where data complications that arise in practice are discussed The near failure of this assumption is a recurrent problem in real world data Section 4 3 is concerned with unbiased estimation Assumption A3 that the disturbances and the independent variables are uncorrelated is a pivotal result in this discussion Assumption A4 homoscedasticity and nonautocorrelation of the disturbances in contrast to A3 only has relevance to whether least squares is an optimal use of the data As noted there are alternative estimators available but with Assumption A4 the least squares estimator is usually going to be preferable Sections 4 4 and 4 5 present several statistical results for the least squares estimator that depend crucially on this assumption The assumption that the data in X are nonstochastic known constants has some implications for how certain derivations
    41

    Greene 50240

    book

    June 3 2002

    9 57

    42

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    TABLE 4 1

    Assumptions of the Classical Linear Regression Model

    A1 Linearity yi xi 1 1 xi 2 2 K xi K i A2 Full rank The n K sample data matrix X has full column rank A3 Exogeneity of the independent variables E i x j 1 x j 2 x j K 0 i j 1 n There is no correlation between the disturbances and the independent variables A4 Homoscedasticity and nonautocorrelation Each disturbance i has the same nite variance 2 and is uncorrelated with every other disturbance j A5 Exogenously generated data xi 1 xi 2 xi K i 1 n A6 Normal distribution The disturbances are normally distributed

    proceed but in practical terms is a minor consideration Indeed nearly all that we do with the regression model departs from this assumption fairly quickly It serves only as a useful departure point The issue is considered in Section 4 5 Finally the normality of the disturbances assumed in A6 is crucial in obtaining the sampling distributions of several useful statistics that are used in the analysis of the linear model We note that in the course of our analysis of the linear model as we proceed through Chapter 9 all six of these assumptions will be discarded

    4 2

    MOTIVATING LEAST SQUARES Ease of computation is one reason that least squares is so popular However there are several other justi cations for this technique First least squares is a natural approach to estimation which makes explicit use of the structure of the model as laid out in the assumptions Second even if the true model is not a linear regression the regression line t by least squares is an optimal linear predictor for the dependent variable Thus it enjoys a sort of robustness that other estimators do not Finally under the very speci c assumptions of the classical model by one reasonable criterion least squares will be the most ef cient use of the data We will consider each of these in turn
    4 2 1 THE POPULATION ORTHOGONALITY CONDITIONS

    Let x denote the vector of independent variables in the population regression model and for the moment based on assumption A5 the data may be stochastic or nonstochastic Assumption A3 states that the disturbances in the population are stochastically orthogonal to the independent variables in the model that is E x 0 It follows that Cov x 0 Since by the law of iterated expectations Theorem B 1 Ex E x E 0 we may write this as Ex E x Ex Ey x y x 0 or Ex Ey x y Ex xx 4 1

    The right hand side is not a function of y so the expectation is taken only over x Now recall the least squares normal equations X y X Xb Divide this by n and write it as

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    43

    a summation to obtain 1 n
    n

    xi yi
    i 1



    1 n

    n

    xi xi
    i 1

    b

    4 2

    Equation 4 1 is a population relationship Equation 4 2 is a sample analog Assuming the conditions underlying the laws of large numbers presented in Appendix D are met the sums on the left hand and right hand sides of 4 2 are estimators of their counterparts in 4 1 Thus by using least squares we are mimicking in the sample the relationship in the population We ll return to this approach to estimation in Chapters 10 and 18 under the subject of GMM estimation
    4 2 2 MINIMUM MEAN SQUARED ERROR PREDICTOR

    As an alternative approach consider the problem of nding an optimal linear predictor for y Once again ignore Assumption A6 and in addition drop Assumption A1 that the conditional mean function E y x is linear For the criterion we will use the mean squared error rule so we seek the minimum mean squared error linear predictor of y which we ll denote x The expected squared error of this predictor is MSE Ey Ex y x 2 This can be written as MSE Ey x y E y x
    2

    Ey x E y x x

    2



    We seek the that minimizes this expectation The rst term is not a function of so only the second term needs to be minimized Note that this term is not a function of y so the outer expectation is actually super uous But we will need it shortly so we will carry it for the present The necessary condition is Ey Ex E y x x 2 Ey Ex E y x x 2

    2 Ey Ex x E y x x 0 Note that we have interchanged the operations of expectation and differentiation in the middle step since the range of integration is not a function of Finally we have the equivalent condition Ey Ex x E y x Ey Ex xx The left hand side of this result is Ex Ey x E y x Cov x E y x E x Ex E y x Cov x y E x E y Ex Ey x y We have used theorem B 2 Therefore the necessary condition for nding the minimum MSE predictor is Ex Ey x y Ex Ey xx 4 3

    This is the same as 4 1 which takes us to the least squares condition once again Assuming that these expectations exist they would be estimated by the sums in 4 2 which means that regardless of the form of the conditional mean least squares is an estimator of the coef cients of the minimum expected mean squared error linear predictor We have yet to establish the conditions necessary for the if part of the

    Greene 50240

    book

    June 3 2002

    9 57

    44

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    theorem but this is an opportune time to make it explicit

    THEOREM 4 1 Minimum Mean Squared Error Predictor If the data generating mechanism generating x i yi i 1 n is such that the law of large numbers applies to the estimators in 4 2 of the matrices in 4 1 then the minimum expected squared error linear predictor of yi is estimated by the least squares regression line

    4 2 3

    MINIMUM VARIANCE LINEAR UNBIASED ESTIMATION

    Finally consider the problem of nding a linear unbiased estimator If we seek the one which has smallest variance we will be led once again to least squares This proposition will be proved in Section 4 4 The preceding does not assert that no other competing estimator would ever be preferable to least squares We have restricted attention to linear estimators The result immediately above precludes what might be an acceptably biased estimator And of course the assumptions of the model might themselves not be valid Although A5 and A6 are ultimately of minor consequence the failure of any of the rst four assumptions would make least squares much less attractive than we have suggested here

    4 3

    UNBIASED ESTIMATION The least squares estimator is unbiased in every sample To show this write b X X 1 X y X X 1 X X X X 1 X Now take expectations iterating over X E b X E X X 1 X X By Assumption A3 the second term is 0 so E b X Therefore E b EX E b X EX The interpretation of this result is that for any particular set of observations X the least squares estimator has expectation Therefore when we average this over the possible values of X we nd the unconditional mean is as well
    Example 4 1 The Sampling Distribution of a Least Squares Estimator

    4 4

    The following sampling experiment which can be replicated in any computer program that provides a random number generator and a means of drawing a random sample of observations from a master data set shows the nature of a sampling distribution and the implication of unbiasedness We drew two samples of 10 000 random draws on wi and xi from the standard

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    45

    100

    75

    Frequency

    50

    25

    0 338 388 438 488 b
    FIGURE 4 1 Histogram for Sampled Least Squares Regression Slopes

    538

    588

    638

    688

    normal distribution mean zero variance 1 We then generated a set of i s equal to 0 5wi and yi 0 5 0 5xi i We take this to be our population We then drew 500 random samples of 100 observations from this population and with each one computed the least squares 100 100 slope using at replication r br j 1 x j r xr y j r j 1 x j r xr 2 The histogram in Figure 4 1 shows the result of the experiment Note that the distribution of slopes has a mean roughly equal to the true value of 0 5 and it has a substantial variance re ecting the fact that the regression slope like any other statistic computed from the sample is a random variable The concept of unbiasedness relates to the central tendency of this distribution of values obtained in repeated sampling from the population

    4 4

    THE VARIANCE OF THE LEAST SQUARES ESTIMATOR AND THE GAUSS MARKOV THEOREM If the regressors can be treated as nonstochastic as they would be in an experimental situation in which the analyst chooses the values in X then the sampling variance of the least squares estimator can be derived by treating X as a matrix of constants Alternatively we can allow X to be stochastic do the analysis conditionally on the observed X then consider averaging over X as we did in the preceding section Using 4 4 again we have b X X 1 X X X X 1 X 4 5

    Since we can write b A where A is X X 1 X b is a linear function of the disturbances which by the de nition we will use makes it a linear estimator As we have

    Greene 50240

    book

    June 3 2002

    9 57

    46

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    seen the expected value of the second term in 4 5 is 0 Therefore regardless of the distribution of under our other assumptions b is a linear unbiased estimator of The covariance matrix of the least squares slope estimator is Var b X E b b X E X X 1 X X X X 1 X X X 1 X E X X X X 1 X X 1 X 2 I X X X 1 2 X X 1
    Example 4 2 Sampling Variance in the Two Variable Regression Model

    Suppose that X contains only a constant term column of 1s and a single regressor x The lower right element of 2 X X 1 is Var b x Var b x
    n i 1

    2 xi x 2

    Note in particular the denominator of the variance of b The greater the variation in x the smaller this variance For example consider the problem of estimating the slopes of the two regressions in Figure 4 2 A more precise result will be obtained for the data in the right hand panel of the gure

    We will now obtain a general result for the class of linear unbiased estimators of Let b0 Cy be another linear unbiased estimator of where C is a K n matrix If b0 is unbiased then E Cy X E CX C X which implies that CX I There are many candidates For example consider using just the rst K or any K linearly independent rows of X Then C X 1 0 where X 1 0 0
    FIGURE 4 2 Effect of Increased Variation in x Given the Same Conditional and Overall Variation in y

    y

    y

    x

    x

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    47

    is the transpose of the matrix formed from the K rows of X The covariance matrix of b0 can be found by replacing X X 1 X with C in 4 5 the result is Var b0 X 2 CC Now let D C X X 1 X so Dy b0 b Then Var b0 X 2 D X X 1 X D X X 1 X We know that CX I DX X X 1 X X so DX must equal 0 Therefore Var b0 X 2 X X 1 2 DD Var b X 2 DD Since a quadratic form in DD is q DD q z z 0 the conditional covariance matrix of b0 equals that of b plus a nonnegative de nite matrix Therefore every quadratic form in Var b0 X is larger than the corresponding quadratic form in Var b X which implies a very important property of the least squares coef cient vector

    THEOREM 4 2 Gauss Markov Theorem In the classical linear regression model with regressor matrix X the least squares estimator b is the minimum variance linear unbiased estimator of For any vector of constants w the minimum variance linear unbiased estimator of w in the classical regression model is w b where b is the least squares estimator

    The proof of the second statement follows from the previous derivation since the variance of w b is a quadratic form in Var b X and likewise for any b0 and proves that each individual slope estimator bk is the best linear unbiased estimator of k Let w be all zeros except for a one in the kth position The theorem is much broader than this however since the result also applies to every other linear combination of the elements of

    4 5

    THE IMPLICATIONS OF STOCHASTIC REGRESSORS The preceding analysis is done conditionally on the observed data A convenient method of obtaining the unconditional statistical properties of b is to obtain the desired results conditioned on X rst then nd the unconditional result by averaging e g by integrating over the conditional distributions The crux of the argument is that if we can establish unbiasedness conditionally on an arbitrary X then we can average over X s to obtain an unconditional result We have already used this approach to show the unconditional unbiasedness of b in Section 4 3 so we now turn to the conditional variance The conditional variance of b is Var b X 2 X X 1

    Greene 50240

    book

    June 3 2002

    9 57

    48

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    For the exact variance we use the decomposition of variance of B 70 Var b EX Var b X VarX E b X The second term is zero since E b X for all X so Var b EX 2 X X 1 2 EX X X 1 Our earlier conclusion is altered slightly We must replace X X 1 with its expected value to get the appropriate covariance matrix which brings a subtle change in the interpretation of these results The unconditional variance of b can only be described in terms of the average behavior of X so to proceed further it would be necessary to make some assumptions about the variances and covariances of the regressors We will return to this subject in Chapter 5 We showed in Section 4 4 that Var b X Var b0 X for any b0 b and for the speci c X in our sample But if this inequality holds for every particular X then it must hold for Var b EX Var b X That is if it holds for every particular X then it must hold over the average value s of X The conclusion therefore is that the important results we have obtained thus far for the least squares estimator unbiasedness and the Gauss Markov theorem hold whether or not we regard X as stochastic

    THEOREM 4 3 Gauss Markov Theorem Concluded In the classical linear regression model the least squares estimator b is the minimum variance linear unbiased estimator of whether X is stochastic or nonstochastic so long as the other assumptions of the model continue to hold

    4 6

    ESTIMATING THE VARIANCE OF THE LEAST SQUARES ESTIMATOR If we wish to test hypotheses about or to form con dence intervals then we will require a sample estimate of the covariance matrix Var b X 2 X X 1 The population parameter 2 remains to be estimated Since 2 is the expected value of i2 and ei is an estimate of i by analogy 2 1 n
    n

    ei2
    i 1

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    49

    would seem to be a natural estimator But the least squares residuals are imperfect estimates of their population counterparts ei yi xi b i xi b The estimator is distorted as might be expected because is not observed directly The expected square on the right hand side involves a second term that might not have expected value zero The least squares residuals are e My M X M as MX 0 See 3 15 An estimator of 2 will be based on the sum of squared residuals e e M The expected value of this quadratic form is E e e X E M X The scalar M is a 1 1 matrix so it is equal to its trace By using the result on cyclic permutations A 94 E tr M X E tr M X Since M is a function of X the result is tr M E X tr M 2 I 2 tr M The trace of M is tr In X X X 1 X tr In tr X X 1 X X tr In tr I K n K Therefore E e e X n K 2 so the natural estimator is biased toward zero although the bias becomes smaller as the sample size increases An unbiased estimator of 2 is s2 ee n K 4 7 4 6

    The estimator is unbiased unconditionally as well since E s 2 EX E s 2 X EX 2 2 The standard error of the regression is s the square root of s 2 With s 2 we can then compute Est Var b X s 2 X X 1 Henceforth we shall use the notation Est Var to indicate a sample estimate of the sampling variance of an estimator The square root of the kth diagonal element of 1 2 this matrix s 2 X X 1 kk is the standard error of the estimator bk which is often denoted simply the standard error of bk

    Greene 50240

    book

    June 3 2002

    9 57

    50

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    4 7

    THE NORMALITY ASSUMPTION AND BASIC STATISTICAL INFERENCE To this point our speci cation and analysis of the regression model is semiparametric see Section 16 3 We have not used Assumption A6 see Table 4 1 normality of in any of our results The assumption is useful for constructing statistics for testing hypotheses In 4 5 b is a linear function of the disturbance vector If we assume that has a multivariate normal distribution then we may use the results of Section B 10 2 and the mean vector and covariance matrix derived earlier to state that b X N 2 X X 1 4 8

    This speci es a multivariate normal distribution so each element of b X is normally distributed bk X N k 2 X X 1 kk 4 9

    The distribution of b is conditioned on X The normal distribution of b in a nite sample is a consequence of our speci c assumption of normally distributed disturbances Without this assumption and without some alternative speci c assumption about the distribution of we will not be able to make any de nite statement about the exact distribution of b conditional or otherwise In an interesting result that we will explore at length in Chapter 5 we will be able to obtain an approximate normal distribution for b with or without assuming normally distributed disturbances and whether the regressors are stochastic or not
    4 7 1 TESTING A HYPOTHESIS ABOUT A COEFFICIENT

    Let Skk be the kth diagonal element of X X 1 Then assuming normality bk k zk 2 Skk 4 10

    has a standard normal distribution If 2 were known then statistical inference about k could be based on zk By using s 2 instead of 2 we can derive a statistic to use in place of zk in 4 10 The quantity n K s 2 ee 2 2 M 4 11

    is an idempotent quadratic form in a standard normal vector Therefore it has a chi squared distribution with rank M trace M n K degrees of freedom 1 The chi squared variable in 4 11 is independent of the standard normal variable in 4 10 To prove this it suf ces to show that b X X 1 X 4 12

    is independent of n K s 2 2 In Section B 11 7 Theorem B 12 we found that a suf cient condition for the independence of a linear form Lx and an idempotent quadratic
    1 This

    fact is proved in Section B 10 3

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    51

    form x Ax in a standard normal vector x is that LA 0 Letting be the x we nd that the requirement here would be that X X 1 X M 0 It does as seen in 3 15 The general result is central in the derivation of many test statistics in regression analysis

    THEOREM 4 4 Independence of b and s2 If is normally distributed then the least squares coef cient estimator b is statistically independent of the residual vector e and therefore all functions of e including s 2

    Therefore the ratio tk

    bk k 2 Skk n K s 2 2 n

    bk k K s 2 Skk

    4 13

    has a t distribution with n K degrees of freedom 2 We can use tk to test hypotheses or form con dence intervals about the individual elements of A common test is whether a parameter k is signi cantly different from zero The appropriate test statistic t bk sbk 4 14

    is presented as standard output with the other results by most computer programs The test is done in the usual way This statistic is usually labeled the t ratio for the estimator bk If bk sbk t 2 where t 2 is the 100 1 2 percent critical value from the t distribution with n K degrees of freedom then the hypothesis is rejected and the coef cient is said to be statistically signi cant The value of 1 96 which would apply for the 5 percent signi cance level in a large sample is often used as a benchmark value when a table of critical values is not immediately available The t ratio for the test of the hypothesis that a coef cient equals zero is a standard part of the regression output of most computer programs
    Example 4 3 Earnings Equation

    Appendix Table F4 1 contains 753 observations used in Mroz s 1987 study of labor supply behavior of married women We will use these data at several points below Of the 753 individuals in the sample 428 were participants in the formal labor market For these individuals we will t a semilog earnings equation of the form suggested in Example 2 2 ln earnings 1 2 age 3 age2 4 education 5 kids where earnings is hourly wage times hours worked education is measured in years of schooling and kids is a binary variable which equals one if there are children under 18 in the household See the data description in Appendix F for details Regression results are shown in Table 4 2 There are 428 observations and 5 parameters so the t statistics have 423 degrees
    2 See

    B 36 in Section B 4 2 It is the ratio of a standard normal variable to the square root of a chi squared variable divided by its degrees of freedom

    Greene 50240

    book

    June 3 2002

    9 57

    52

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    TABLE 4 2

    Regression Results for an Earnings Equation 599 4582 1 19044 0 040995
    Standard Error t Ratio

    Sum of squared residuals Standard error of the regression R2 based on 428 observations
    Variable Coef cient

    Constant Age Age2 Education Kids

    3 24009 0 20056 0 0023147 0 067472 0 35119

    1 7674 0 08386 0 00098688 0 025248 0 14753

    1 833 2 392 2 345 2 672 2 380

    Estimated Covariance Matrix for b e n times 10 n Constant Age Age2 Education Kids

    3 12381 0 14409 0 0016617 0 0092609 0 026749

    0 0070325 8 23237e 5 5 08549e 5 0 0026412

    9 73928e 7 4 96761e 7 3 84102e 5

    0 00063729 5 46193e 5

    0 021766

    of freedom For 95 percent signi cance levels the standard normal value of 1 96 is appropriate when the degrees of freedom are this large By this measure all variables are statistically signi cant and signs are consistent with expectations It will be interesting to investigate whether the effect of Kids is on the wage or hours or both We interpret the schooling variable to imply that an additional year of schooling is associated with a 6 7 percent increase in earnings The quadratic age pro le suggests that for a given education level and family size earnings rise to the peak at b2 2b3 which is about 43 years of age at which they begin to decline Some points to note 1 Our selection of only those individuals who had positive hours worked is not an innocent sample selection mechanism Since individuals chose whether or not to be in the labor force it is likely almost certain that earnings potential was a signi cant factor along with some other aspects we will consider in Chapter 22 2 The earnings equation is a mixture of a labor supply equation hours worked by the individual and a labor demand outcome the wage is presumably an accepted offer As such it is unclear what the precise nature of this equation is Presumably it is a hash of the equations of an elaborate structural equation system

    4 7 2

    CONFIDENCE INTERVALS FOR PARAMETERS

    A con dence interval for k would be based on 4 13 We could say that Prob bk t 2 sbk k bk t 2 sbk 1 where 1 is the desired level of con dence and t 2 is the appropriate critical value from the t distribution with n K degrees of freedom
    Example 4 4 Con dence Interval for the Income Elasticity of Demand for Gasoline

    Using the gasoline market data discussed in Example 2 3 we estimated following demand equation using the 36 observations Estimated standard errors computed as shown above

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    53

    are given in parentheses below the least squares estimates ln G pop 7 737 0 05910 ln PG 1 3733 ln income 0 6749 0 03248 0 075628 0 081337 0 12680 ln Pnc 0 11871 ln Puc e 0 12699 To form a con dence interval for the income elasticity we need the critical value from the t distribution with n K 36 5 degrees of freedom The 95 percent critical value is 2 040 Therefore a 95 percent con dence interval for I is 1 3733 2 040 0 075628 or 1 2191 1 5276 We are interested in whether the demand for gasoline is income inelastic The hypothesis to be tested is that I is less than 1 For a one sided test we adjust the critical region and use the t critical point from the distribution Values of the sample estimate that are greatly inconsistent with the hypothesis cast doubt upon it Consider testing the hypothesis H0 I 1 The appropriate test statistic is t 1 3733 1 4 936 0 075628 versus H1 I 1

    The critical value from the t distribution with 31 degrees of freedom is 2 04 which is far less than 4 936 We conclude that the data are not consistent with the hypothesis that the income elasticity is less than 1 so we reject the hypothesis
    4 7 3 CONFIDENCE INTERVAL FOR A LINEAR COMBINATION OF COEFFICIENTS THE OAXACA DECOMPOSITION

    With normally distributed disturbances the least squares coef cient estimator b is normally distributed with mean and covariance matrix 2 X X 1 In Example 4 4 we showed how to use this result to form a con dence interval for one of the elements of By extending those results we can show how to form a con dence interval for a linear function of the parameters Oaxaca s 1973 decomposition provides a frequently used application Let w denote a K 1 vector of known constants Then the linear combination c w b is normally distributed with mean w and variance c2 w 2 X X 1 w 2 which we estimate with sc w s 2 X X 1 w With these in hand we can use the earlier results to form a con dence interval for Prob c t 2 sc c t 2 sc 1 This general result can be used for example for the sum of the coef cients or for a difference Consider then Oaxaca s application In a study of labor supply separate wage regressions are t for samples of nm men and n f women The underlying regression models are ln wagem i xm i m m i and ln wage f j x f j f f j j 1 n f i 1 nm

    Greene 50240

    book

    June 3 2002

    9 57

    54

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    The regressor vectors include sociodemographic variables such as age and human capital variables such as education and experience We are interested in comparing these two regressions particularly to see if they suggest wage discrimination Oaxaca suggested a comparison of the regression functions For any two vectors of characteristics E ln wagem i E ln wage f j xm i m x f j f xm i m xm i f xm i f x f j f xm i m f xm i x f j f The second term in this decomposition is identi ed with differences in human capital that would explain wage differences naturally assuming that labor markets respond to these differences in ways that we would expect The rst term shows the differential in log wages that is attributable to differences unexplainable by human capital holding these factors constant at xm makes the rst term attributable to other factors Oaxaca suggested that this decomposition be computed at the means of the two regressor vec tors xm and x f and the least squares coef cient vectors bm and b f If the regressions contain constant terms then this process will be equivalent to analyzing ln ym ln y f We are interested in forming a con dence interval for the rst term which will require two applications of our result We will treat the two vectors of sample means as known vectors Assuming that we have two independent sets of observations our two estimators bm and b f are independent with means m and f and covariance matrices 2 m XmXm 1 and 2 X f X f 1 The covariance matrix of the difference is the sum of f these two matrices We are forming a con dence interval for xm d where d bm b f The estimated covariance matrix is
    2 Est Var d sm XmXm 1 s 2 X f X f 1 f

    Now we can apply the result above We can also form a con dence interval for the second term just de ne w xm x f and apply the earlier result to w b f
    4 7 4 TESTING THE SIGNIFICANCE OF THE REGRESSION

    A question that is usually of interest is whether the regression equation as a whole is signi cant This test is a joint test of the hypotheses that all the coef cients except the constant term are zero If all the slopes are zero then the multiple correlation coef cient is zero as well so we can base a test of this hypothesis on the value of R2 The central result needed to carry out the test is the distribution of the statistic F K 1 n K R2 K 1 1 R2 n K 4 15

    If the hypothesis that 2 0 the part of not including the constant is true and the disturbances are normally distributed then this statistic has an F distribution with K 1 and n K degrees of freedom 3 Large values of F give evidence against the validity of the hypothesis Note that a large F is induced by a large value of R2 The logic of the test is that the F statistic is a measure of the loss of t namely all of R2 that results when we impose the restriction that all the slopes are zero If F is large then the hypothesis is rejected
    3 The

    proof of the distributional result appears in Section 6 3 1 The F statistic given above is the special case in which R 0 I K 1

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator Example 4 5 F Test for the Earnings Equation

    55

    The F ratio for testing the hypothesis that the four slopes in the earnings equation are all zero is 0 040995 4 F 4 423 4 521 1 0 040995 428 5 which is far larger than the 95 percent critical value of 2 37 We conclude that the data are inconsistent with the hypothesis that all the slopes in the earnings equation are zero We might have expected the preceding result given the substantial t ratios presented earlier But this case need not always be true Examples can be constructed in which the individual coef cients are statistically signi cant while jointly they are not This case can be regarded as pathological but the opposite one in which none of the coef cients is signi cantly different from zero while R2 is highly signi cant is relatively common The problem is that the interaction among the variables may serve to obscure their individual contribution to the t of the regression whereas their joint effect may still be signi cant We will return to this point in Section 4 9 1 in our discussion of multicollinearity
    4 7 5 MARGINAL DISTRIBUTIONS OF THE TEST STATISTICS

    We now consider the relation between the sample test statistics and the data in X First 0 consider the conventional t statistic in 4 14 for testing H0 k k t X
    0 bk k

    s 2 X X 1 kk

    1 2



    0 Conditional on X if k k i e under H0 then t X has a t distribution with n K degrees of freedom What interests us however is the marginal that is the unconditional distribution of t As we saw b is only normally distributed conditionally on X the marginal distribution may not be normal because it depends on X through the conditional variance Similarly because of the presence of X the denominator of the t statistic is not the square root of a chi squared variable divided by its degrees of freedom again except conditional on this X But because the distributions of bk k 2 X X 1 1 2 X and n K s 2 2 X are still independent N 0 1 kk and 2 n K respectively which do not involve X we have the surprising result that regardless of the distribution of X or even of whether X is stochastic or nonstochastic the marginal distributions of t is still t even though the marginal distribution of bk may be nonnormal This intriguing result follows because f t X is not a function of X The same reasoning can be used to deduce that the usual F ratio used for testing linear restrictions is valid whether X is stochastic or not This result is very powerful The implication is that if the disturbances are normally distributed then we may carry out tests and construct con dence intervals for the parameters without making any changes in our procedures regardless of whether the regressors are stochastic nonstochastic or some mix of the two

    4 8

    FINITE SAMPLE PROPERTIES OF LEAST SQUARES A summary of the results we have obtained for the least squares estimator appears in Table 4 3 For constructing con dence intervals and testing hypotheses we derived some additional results that depended explicitly on the normality assumption Only

    Greene 50240

    book

    June 3 2002

    9 57

    56

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    TABLE 4 3

    Finite Sample Properties of Least Squares

    General results FS1 E b X E b Least squares is unbiased FS2 Var b X 2 X X 1 Var b 2 E X X 1 FS3 Gauss Markov theorem The MVLUE of w is w b FS4 E s 2 X E s 2 2 FS5 Cov b e X E b e X E X X 1 X M X 0 as X 2 I M 0 Results that follow from Assumption A6 normally distributed disturbances FS6 b and e are statistically independent It follows that b and s 2 are uncorrelated and statistically independent FS7 The exact distribution of b X is N 2 X X 1 FS8 n K s 2 2 2 n K s 2 has mean 2 and variance 2 4 n K Test Statistics based on results FS6 through FS8 FS9 t n K bk k s 2 X X 1 1 2 t n K independently of X kk FS10 The test statistic for testing the null hypothesis that all slopes in the model are zero F K 1 n K R2 K 1 1 R2 n K has an F distribution with K 1 and n K degrees of freedom when the null hypothesis is true

    FS7 depends on whether X is stochastic or not If so then the marginal distribution of b depends on that of X Note the distinction between the properties of b established using A1 through A4 and the additional inference results obtained with the further assumption of normality of the disturbances The primary result in the rst set is the Gauss Markov theorem which holds regardless of the distribution of the disturbances The important additional results brought by the normality assumption are FS9 and FS10

    4 9

    DATA PROBLEMS In this section we consider three practical problems that arise in the setting of regression analysis multicollinearity missing observations and outliers
    4 9 1 MULTICOLLINEARITY

    The Gauss Markov theorem states that among all linear unbiased estimators the least squares estimator has the smallest variance Although this result is useful it does not assure us that the least squares estimator has a small variance in any absolute sense Consider for example a model that contains two explanatory variables and a constant For either slope coef cient Var bk 2
    2 1 r12 n i 1 xik

    xk 2



    2 2 1 r12 Skk

    k 1 2

    4 16

    If the two variables are perfectly correlated then the variance is in nite The case of an exact linear relationship among the regressors is a serious failure of the assumptions of the model not of the data The more common case is one in which the variables are highly but not perfectly correlated In this instance the regression model retains all its assumed properties although potentially severe statistical problems arise The

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    57

    problem faced by applied researchers when regressors are highly although not perfectly correlated include the following symptoms



    Small changes in the data produce wide swings in the parameter estimates Coef cients may have very high standard errors and low signi cance levels even though they are jointly signi cant and the R2 for the regression is quite high Coef cients may have the wrong sign or implausible magnitudes

    For convenience de ne the data matrix X to contain a constant and K 1 other variables measured in deviations from their means Let xk denote the kth variable and let X k denote all the other variables including the constant term Then in the inverse matrix X X 1 the kth diagonal element is xkM k xk
    1

    xkxk xkX k X k X k xkxk

    1

    X k xk
    1

    1 1

    xkX k X k X k 1 xkxk

    X k xk

    4 17

    1 2 1 Rk Skk

    2 where Rk is the R2 in the regression of xk on all the other variables In the multiple regression model the variance of the kth least squares coef cient estimator is 2 times this ratio It then follows that the more highly correlated a variable is with the other variables in the model collectively the greater its variance will be In the most extreme case in which xk can be written as a linear combination of the other variables so that 2 Rk 1 the variance becomes in nite The result

    Var bk

    2
    2 1 Rk n i 1 xik

    xk 2



    4 18

    shows the three ingredients of the precision of the kth least squares coef cient estimator



    Other things being equal the greater the correlation of xk with the other variables the higher the variance will be due to multicollinearity Other things being equal the greater the variation in xk the lower the variance will be This result is shown in Figure 4 2 Other things being equal the better the overall t of the regression the lower the variance will be This result would follow from a lower value of 2 We have yet to develop this implication but it can be suggested by Figure 4 2 by imagining the identical gure in the right panel but with all the points moved closer to the regression line

    2 Since nonexperimental data will never be orthogonal Rk 0 to some extent multicollinearity will always be present When is multicollinearity a problem That is when are the variances of our estimates so adversely affected by this intercorrelation that we should be concerned Some computer packages report a variance in ation factor 2 VIF 1 1 Rk for each coef cient in a regression as a diagnostic statistic As can be seen the VIF for a variable shows the increase in Var bk that can be attributable to the fact that this variable is not orthogonal to the other variables in the model Another measure that is speci cally directed at X is the condition number of X X which is the

    Greene 50240

    book

    June 3 2002

    9 57

    58

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    TABLE 4 4

    Longley Results Dependent Variable is Employment
    1947 1961 Variance In ation 1947 1962

    Constant Year GNP de ator GNP Armed Forces

    1 459 415 721 756 181 123 0 0910678 0 0749370

    251 839 75 6716 132 467 1 55319

    1 169 087 576 464 19 7681 0 0643940 0 0101453

    square root ratio of the largest characteristic root of X X after scaling each column so that it has unit length to the smallest Values in excess of 20 are suggested as indicative of a problem Belsley Kuh and Welsch 1980 The condition number for the Longley data of Example 4 6 is over 15 000
    Example 4 6 Multicollinearity in the Longley Data

    The data in Table F4 2 were assembled by J Longley 1967 for the purpose of assessing the accuracy of least squares computations by computer programs These data are still widely used for that purpose The Longley data are notorious for severe multicollinearity Note for example the last year of the data set The last observation does not appear to be unusual But the results in Table 4 4 show the dramatic effect of dropping this single observation from a regression of employment on a constant and the other variables The last coef cient rises by 600 percent and the third rises by 800 percent

    Several strategies have been proposed for nding and coping with multicollinearity 4 Under the view that a multicollinearity problem arises because of a shortage of information one suggestion is to obtain more data One might argue that if analysts had such additional information available at the outset they ought to have used it before reaching this juncture More information need not mean more observations however The obvious practical remedy and surely the most frequently used is to drop variables suspected of causing the problem from the regression that is to impose on the regression an assumption possibly erroneous that the problem variable does not appear in the model In doing so one encounters the problems of speci cation that we will discuss in Section 8 2 If the variable that is dropped actually belongs in the model in the sense that its coef cient k is not zero then estimates of the remaining coef cients will be biased possibly severely so On the other hand over tting that is trying to estimate a model that is too large is a common error and dropping variables from an excessively speci ed model might have some virtue Several other practical approaches have also been suggested The ridge regression estimator is br X X r D 1 X y where D is a diagonal matrix This biased estimator has a covariance matrix unambiguously smaller than that of b The tradeoff of some bias for smaller variance may be worth making see Judge et al 1985 but nonetheless economists are generally averse to biased estimators so this approach has seen little practical use Another approach sometimes used see e g Gurmu Rilstone and Stern 1999 is to use a small number say L of principal components constructed from the K original variables See Johnson and Wichern 1999 The problem here is that if the original model in the form y X were correct then it is unclear what one is estimating when one regresses y on some
    4 See

    Hill and Adkins 2001 for a description of the standard set of tools for diagnosing collinearity

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    59

    small set of linear combinations of the columns of X Algebraically it is simple at least for the principal components case in which we regress y on Z XC L to obtain d it follows that E d C LC L In an economic context if has an interpretation then it is unlikely that will How do we interpret the price elasticity plus minus twice the income elasticity Using diagnostic tools to detect multicollinearity could be viewed as an attempt to distinguish a bad model from bad data But in fact the problem only stems from a prior opinion with which the data seem to be in con ict A nding that suggests multicollinearity is adversely affecting the estimates seems to suggest that but for this effect all the coef cients would be statistically signi cant and of the right sign Of course this situation need not be the case If the data suggest that a variable is unimportant in a model then the theory notwithstanding the researcher ultimately has to decide how strong the commitment is to that theory Suggested remedies for multicollinearity might well amount to attempts to force the theory on the data
    4 9 2 MISSING OBSERVATIONS

    It is fairly common for a data set to have gaps for a variety of reasons Perhaps the most common occurrence of this problem is in survey data in which it often happens that respondents simply fail to answer the questions In a time series the data may be missing because they do not exist at the frequency we wish to observe them for example the model may specify monthly relationships but some variables are observed only quarterly There are two possible cases to consider depending on why the data are missing One is that the data are simply unavailable for reasons unknown to the analyst and unrelated to the completeness of the other observations in the sample If this is the case then the complete observations in the sample constitute a usable data set and the only issue is what possibly helpful information could be salvaged from the incomplete observations Griliches 1986 calls this the ignorable case in that for purposes of estimation if we are not concerned with ef ciency then we may simply ignore the problem A second case which has attracted a great deal of attention in the econometrics literature is that in which the gaps in the data set are not benign but are systematically related to the phenomenon being modeled This case happens most often in surveys when the data are self selected or self reported 5 For example if a survey were designed to study expenditure patterns and if high income individuals tended to withhold information about their income then the gaps in the data set would represent more than just missing information In this case the complete observations would be qualitatively different We treat this second case in Chapter 22 so we shall defer our discussion until later In general not much is known about the properties of estimators based on using predicted values to ll missing values of y Those results we do have are largely from simulation studies based on a particular data set or pattern of missing data The results of these Monte Carlo studies are usually dif cult to generalize The overall conclusion
    5 The

    vast surveys of Americans opinions about sex by Ann Landers 1984 passim and Shere Hite 1987 constitute two celebrated studies that were surely tainted by a heavy dose of self selection bias The latter was pilloried in numerous publications for purporting to represent the population at large instead of the opinions of those strongly enough inclined to respond to the survey The rst was presented with much greater modesty

    Greene 50240

    book

    June 3 2002

    9 57

    60

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    seems to be that in a single equation regression context lling in missing values of y leads to biases in the estimator which are dif cult to quantify For the case of missing data in the regressors it helps to consider the simple regression and multiple regression cases separately In the rst case X has two columns the column of 1s for the constant and a column with some blanks where the missing data would be if we had them Several schemes have been suggested for lling the blanks The zero order method of replacing each missing x with x results in no changes and is equivalent to dropping the incomplete data See Exercise 7 in Chapter 3 However the R2 will be lower An alternative modi ed zero order regression is to ll the second column of X with zeros and add a variable that takes the value one for missing observations and zero for complete ones 6 We leave it as an exercise to show that this is algebraically identical to simply lling the gaps with x Last there is the possibility of computing tted values for the missing x s by a regression of x on y in the complete data The sampling properties of the resulting estimator are largely unknown but what evidence there is suggests that this is not a bene cial way to proceed 7
    4 9 3 REGRESSION DIAGNOSTICS AND INFLUENTIAL DATA POINTS

    Even in the absence of multicollinearity or other data problems it is worthwhile to examine one s data closely for two reasons First the identi cation of outliers in the data is useful particularly in relatively small cross sections in which the identity and perhaps even the ultimate source of the data point may be known Second it may be possible to ascertain which if any particular observations are especially in uential in the results obtained As such the identi cation of these data points may call for further study It is worth emphasizing though that there is a certain danger in singling out particular observations for scrutiny or even elimination from the sample on the basis of statistical results that are based on those data At the extreme this step may invalidate the usual inference procedures Of particular importance in this analysis is the projection matrix or hat matrix P X X X 1 X 4 19

    This matrix appeared earlier as the matrix that projects any n 1 vector into the column space of X For any vector y Py is the set of tted values in the least squares regression of y on X The least squares residuals are e My M I P so the covariance matrix for the least squares residual vector is E ee 2 M 2 I P To identify which residuals are signi cantly large we rst standardize them by dividing
    6 See

    Maddala 1977a p 202

    7 A

    and Elashoff 1966 1967 and Haitovsky 1968 Griliches 1986 considers a number of other possibilities

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    61

    Standardized Residuals 3 0

    1 8

    Residual

    6

    6

    1 8

    3 0 1946

    1948

    1950

    1952

    1954 YEAR

    1956

    1958

    1960

    1962

    FIGURE 4 3

    Standardized Residuals for the Longley Data

    by the appropriate standard deviations Thus we would use ei ei ei 2 2 1 2 s 1 pii s mii 1 2

    4 20

    where ei is the ith least squares residual s 2 e e n K pii is the i th diagonal element of P and mii is the i th diagonal element of M It is easy to show we leave it as an exercise that ei mii yi xi b i where b i is the least squares slope vector computed without this observation so the standardization is a natural way to investigate whether the particular observation differs substantially from what should be expected given the model speci cation Dividing by s 2 or better s i 2 scales the observations so that the value 2 0 suggested by Belsley et al 1980 provides an appropriate benchmark Figure 4 3 illustrates for the Longley data of the previous example Apparently 1956 was an unusual year according to this model What to do with outliers is a question Discarding an observation in the middle of a time series is probably a bad idea though we may hope to learn something about the data in this way For a cross section one may be able to single out observations that do not conform to the model with this technique

    4 10

    SUMMARY AND CONCLUSIONS

    This chapter has examined a set of properties of the least squares estimator that will apply in all samples including unbiasedness and ef ciency among unbiased estimators The assumption of normality of the disturbances produces the distributions of some useful test statistics which are useful for a statistical assessment of the validity of the regression model The nite sample results obtained in this chapter are listed in Table 4 3

    Greene 50240

    book

    June 3 2002

    9 57

    62

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    We also considered some practical problems that arise when data are less than perfect for the estimation and analysis of the regression model including multicollinearity and missing observations The formal assumptions of the classical model are pivotal in the results of this chapter All of them are likely to be violated in more general settings than the one considered here For example in most cases examined later in the book the estimator has a possible bias but that bias diminishes with increasing sample sizes Also we are going to be interested in hypothesis tests of the type considered here but at the same time the assumption of normality is narrow so it will be necessary to extend the model to allow nonnormal disturbances These and other large sample extensions of the linear model will be considered in Chapter 5 Key Terms and Concepts
    Assumptions Condition number Con dence interval Estimator Gauss Markov Theorem Hat matrix Ignorable case Linear estimator Linear unbiased estimator Mean squared error Minimum mean squared Minimum variance linear

    error

    unbiased estimator Missing observations Multicollinearity Oaxaca s decomposition Optimal linear predictor Orthogonal random variables Principal components Projection matrix Sampling distribution Sampling variance

    Semiparametric Standard Error Standard error of the

    regression
    Statistical properties Stochastic regressors t ratio

    Exercises 1 Suppose that you have two independent unbiased estimators of the same parameter say 1 and 2 with different variances v1 and v2 What linear combination c1 1 c2 2 is the minimum variance unbiased estimator of 2 Consider the simple regression yi xi i where E x 0 and E 2 x 2 a What is the minimum mean squared error linear estimator of Hint Let the estimator be c y Choose c to minimize Var E 2 The answer is a function of the unknown parameters b For the estimator in part a show that ratio of the mean squared error of to that of the ordinary least squares estimator b is MSE 2 MSE b 1 2 where 2 2 2 x x

    Note that is the square of the population analog to the t ratio for testing the hypothesis that 0 which is given in 4 14 How do you interpret the behavior of this ratio as 3 Suppose that the classical regression model applies but that the true value of the constant is zero Compare the variance of the least squares slope estimator computed without a constant term with that of the estimator computed with an unnecessary constant term

    Greene 50240

    book

    June 3 2002

    9 57

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    63

    4 Suppose that the regression model is yi xi i where the disturbances i have f i 1 exp i i 0 This model is rather peculiar in that all the disturbances are assumed to be positive Note that the disturbances have E i xi and Var i xi 2 Show that the least squares slope is unbiased but that the intercept is biased 5 Prove that the least squares intercept estimator in the classical regression model is the minimum variance linear unbiased estimator 6 As a pro t maximizing monopolist you face the demand curve Q P In the past you have set the following prices and sold the accompanying quantities
    Q P 3 18 3 16 7 17 6 12 10 15 15 15 16 4 13 13 9 11 15 6 9 8 15 10 12 7 18 7 21 7

    Suppose that your marginal cost is 10 Based on the least squares regression compute a 95 percent con dence interval for the expected value of the pro t maximizing output 7 The following sample moments for x 1 x1 x2 x3 were computed from 100 observations produced using a random number generator 100 123 96 109 460 123 252 125 189 810 XX 96 125 167 146 X y 615 y y 3924 109 189 146 168 712 The true model underlying these data is y x1 x2 x3 a Compute the simple correlations among the regressors b Compute the ordinary least squares coef cients in the regression of y on a constant x1 x2 and x3 c Compute the ordinary least squares coef cients in the regression of y on a constant x1 and x2 on a constant x1 and x3 and on a constant x2 and x3 d Compute the variance in ation factor associated with each variable e The regressors are obviously collinear Which is the problem variable Consider the multiple regression of y on K variables X and an additional variable z Prove that under the assumptions A1 through A6 of the classical regression model the true variance of the least squares estimator of the slopes on X is larger when z is included in the regression than when it is not Does the same hold for the sample estimate of this covariance matrix Why or why not Assume that X and z are nonstochastic and that the coef cient on z is nonzero For the classical normal regression model y X with no constant term and K regressors assuming that the true value of is zero what is the exact expected value of F K n K R2 K 1 R2 n K K Prove that E b b 2 k 1 1 k where b is the ordinary least squares estimator and k is a characteristic root of X X Data on U S gasoline consumption for the years 1960 to 1995 are given in Table F2 2 a Compute the multiple regression of per capita consumption of gasoline G pop on all the other explanatory variables including the time trend and report all results Do the signs of the estimates agree with your expectations

    8

    9

    10 11

    Greene 50240

    book

    June 3 2002

    9 57

    64

    CHAPTER 4 Finite Sample Properties of the Least Squares Estimator

    b Test the hypothesis that at least in regard to demand for gasoline consumers do not differentiate between changes in the prices of new and used cars c Estimate the own price elasticity of demand the income elasticity and the crossprice elasticity with respect to changes in the price of public transportation d Reestimate the regression in logarithms so that the coef cients are direct estimates of the elasticities Do not use the log of the time trend How do your estimates compare with the results in the previous question Which speci cation do you prefer e Notice that the price indices for the automobile market are normalized to 1967 whereas the aggregate price indices are anchored at 1982 Does this discrepancy affect the results How If you were to renormalize the indices so that they were all 1 000 in 1982 then how would your results change

    Greene 50240

    book

    June 3 2002

    9 59

    5

    LARGE SAMPLE PROPERTIES OF THE LEAST SQUARES AND INSTRUMENTAL VARIABLES ESTIMATORS

    Q
    5 1 INTRODUCTION The discussion thus far has concerned nite sample properties of the least squares estimator We derived its exact mean and variance and the precise distribution of the estimator and several test statistics under the assumptions of normally distributed disturbances and independent observations These results are independent of the sample size But the classical regression model with normally distributed disturbances and independent observations is a special case that does not include many of the most common applications such as panel data and most time series models This chapter will generalize the classical regression model by relaxing these two important assumptions 1 The linear model is one of relatively few settings in which any de nite statements can be made about the exact nite sample properties of any estimator In most cases the only known properties of the estimators are those that apply to large samples We can only approximate nite sample behavior by using what we know about largesample properties This chapter will examine the asymptotic properties of the parameter estimators in the classical regression model In addition to the least squares estimator this chapter will also introduce an alternative technique the method of instrumental variables In this case only the large sample properties are known

    5 2

    ASYMPTOTIC PROPERTIES OF THE LEAST SQUARES ESTIMATOR Using only assumptions A1 through A4 of the classical model as listed in Table 4 1 we have established that the least squares estimators of the unknown parameters and 2 have the exact nite sample properties listed in Table 4 3 For this basic model it is straightforward to derive the large sample properties of the least squares estimator The normality assumption A6 becomes inessential at this point and will be discarded save for brief discussions of maximum likelihood estimation in Chapters 10 and 17 This section will consider various forms of Assumption A5 the data generating mechanism

    1 Most

    of this discussion will use our earlier results on asymptotic distributions It may be helpful to review Appendix D before proceeding

    65

    Greene 50240

    book

    June 3 2002

    9 59

    66

    CHAPTER 5 Large Sample Properties 5 2 1 CONSISTENCY OF THE LEAST SQUARES ESTIMATOR OF

    To begin we leave the data generating mechanism for X unspeci ed X may be any mixture of constants and random variables generated independently of the process that generates We do make two crucial assumptions The rst is a modi cation of Assumption A5 in Table 4 1 A5a xi I i 1 n is a sequence of independent observations XX Q n

    The second concerns the behavior of the data in large samples plim
    n

    a positive de nite matrix

    5 1

    We will return to 5 1 shortly The least squares estimator may be written b If Q 1 exists then plim b Q 1 plim X n XX n
    1

    X n

    5 2

    because the inverse is a continuous function of the original matrix We have invoked Theorem D 14 We require the probability limit of the last term Let 1 1 X n n Then plim b Q 1 plim w From the exogeneity Assumption A3 we have E wi E x E wi xi E x xi E i xi 0 so the exact expectation is E w 0 For any element in xi that is nonstochastic the zero expectations follow from the marginal distribution of i We now consider the variance By B 70 Var w E Var w X Var E w X The second term is zero because E i xi 0 To obtain the rst we use E X 2 I so Var w X E ww X Therefore Var w 2 XX E n n 1 1 X E X X n n 2 n XX n
    n i 1

    1 xi i n

    n

    wi w
    i 1

    5 3

    The variance will collapse to zero if the expectation in parentheses is or converges to a constant matrix so that the leading scalar will dominate the product as n increases Assumption 5 1 should be suf cient Theoretically the expectation could diverge while the probability limit does not but this case would not be relevant for practical purposes It then follows that
    n

    lim Var w 0 Q 0

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    67

    Since the mean of w is identically zero and its variance converges to zero w converges in mean square to zero so plim w 0 Therefore plim so plim b Q 1 0 5 5 X 0 n 5 4

    This result establishes that under Assumptions A1 A4 and the additional assumption 5 1 b is a consistent estimator of in the classical regression model Time series settings that involve time trends polynomial time series and trending variables often pose cases in which the preceding assumptions are too restrictive A somewhat weaker set of assumptions about X that is broad enough to include most of these is the Grenander conditions listed in Table 5 1 2 The conditions ensure that the data matrix is well behaved in large samples The assumptions are very weak and is likely to be satis ed by almost any data set encountered in practice 3
    5 2 2 ASYMPTOTIC NORMALITY OF THE LEAST SQUARES ESTIMATOR

    To derive the asymptotic distribution of the least squares estimator we shall use the results of Section D 3 We will make use of some basic central limit theorems so in addition to Assumption A3 uncorrelatedness we will assume that the observations are independent It follows from 5 2 that n b XX n
    1

    1 X n

    5 6

    Since the inverse matrix is a continuous function of the original matrix plim X X n 1 Q 1 Therefore if the limiting distribution of the random vector in 5 6 exists then that limiting distribution is the same as that of plim XX n
    1

    1 1 X Q 1 X n n

    5 7

    Thus we must establish the limiting distribution of 1 X n w E w n 5 8

    where E w 0 See 5 3 We can use the multivariate Lindberg Feller version of the central limit theorem D 19 A to obtain the limiting distribution of nw 4 Using that formulation w is the average of n independent random vectors wi xi i with means 0 and variances Var xi i 2 E xi xi 2 Qi
    2 Judge 3 White 4 Note

    5 9

    et al 1985 p 162 2001 continues this line of analysis

    that the Lindberg Levy variant does not apply because Var wi is not necessarily constant

    Greene 50240

    book

    June 3 2002

    9 59

    68

    CHAPTER 5 Large Sample Properties

    TABLE 5 1

    Grenander Conditions for Well Behaved Data

    2 2 G1 For each column of X xk if dnk xkxk then limn dnk Hence xk does not degenerate to a sequence of zeros Sums of squares will continue to grow as the sample size increases No variable will degenerate to a sequence of zeros 2 2 G2 Limn xik dnk 0 for all i 1 n This condition implies that no single observation will ever dominate xkxk and as n individual observations will become less important G3 Let Rn be the sample correlation matrix of the columns of X excluding the constant term if there is one Then limn Rn C a positive de nite matrix This condition implies that the full rank condition will always be met We have already assumed that X has full rank in a nite sample so this assumption ensures that the condition will never be violated

    The variance of



    nw is 1 2 Qn 2 Q1 Q2 Qn n 5 10

    As long as the sum is not dominated by any particular term and the regressors are well behaved which in this case means that 5 1 holds Therefore we may apply the Lindberg Feller central limit theorem to the vector n w as we did in Section D 3 for the univariate case nx We now have the elements we need for a formal result If xi i i 1 n are independent vectors distributed with mean 0 and variance 2 Qi and if 5 1 holds then 1 d X N 0 2 Q n It then follows that 1 d Q 1 X N Q 1 0 Q 1 2 Q Q 1 n Combining terms d n b N 0 2 Q 1 Using the technique of Section D 3 we obtain the asymptotic distribution of b 5 14 5 13 5 12
    n

    lim 2 Qn 2 Q

    5 11

    THEOREM 5 1 Asymptotic Distribution of b with Independent Observations If i are independently distributed with mean zero and nite variance 2 and xik is such that the Grenander conditions are met then b N
    a

    2 1 Q n

    5 15

    In practice it is necessary to estimate 1 n Q 1 with X X 1 and 2 with e e n K

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    69

    If is normally distributed then Result FS7 in Table 4 3 Section 4 8 holds in every sample so it holds asymptotically as well The important implication of this derivation is that if the regressors are well behaved and observations are independent then the asymptotic normality of the least squares estimator does not depend on normality of the disturbances it is a consequence of the central limit theorem We will consider other more general cases in the sections to follow
    5 2 3 CONSISTENCY OF s2 AND THE ESTIMATOR OF Asy Var b

    To complete the derivation of the asymptotic properties of b we will require an estimator of Asy Var b 2 n Q 1 5 With 5 1 it is suf cient to restrict attention to s 2 so the purpose here is to assess the consistency of s 2 as an estimator of 2 Expanding s2 produces s2 1 n X X X 1 X n K n k n X n XX n
    1

    1 M n K X n



    The leading constant clearly converges to 1 We can apply 5 1 5 4 twice and the product rule for probability limits Theorem D 14 to assert that the second term in the brackets converges to 0 That leaves 2 1 n
    n

    i2
    i 1

    This is a narrow case in which the random variables i2 are independent with the same nite mean 2 so not much is required to get the mean to converge almost surely to 2 E i2 By the Markov Theorem D 8 what is needed is for E i2 1 to be nite so the minimal assumption thus far is that i have nite moments up to slightly greater than 2 Indeed if we further assume that every i has the same distribution then by the Khinchine Theorem D 5 or the Corollary to D8 nite moments of i up to 2 is suf cient Mean square convergence would require E i4 Then the terms in the sum are independent with mean 2 and variance 4 So under fairly weak condition the rst term in brackets converges in probability to 2 which gives our result plim s 2 2 and by the product rule plim s 2 X X n 1 2 Q 1 The appropriate estimator of the asymptotic covariance matrix of b is Est Asy Var b s 2 X X 1
    5 See McCallum 1973 for some useful commentary on deriving the asymptotic covariance matrix of the least

    squares estimator

    Greene 50240

    book

    June 3 2002

    9 59

    70

    CHAPTER 5 Large Sample Properties 5 2 4 ASYMPTOTIC DISTRIBUTION OF A FUNCTION OF b THE DELTA METHOD

    We can extend Theorem D 22 to functions of the least squares estimator Let f b be a set of J continuous linear or nonlinear and continuously differentiable functions of the least squares estimator and let C b f b b

    where C is the J K matrix whose jth row is the vector of derivatives of the jth function with respect to b By the Slutsky Theorem D 12 plim f b f and plim C b f

    Using our usual linear Taylor series approach we expand this set of functions in the approximation f b f b higher order terms

    The higher order terms become negligible in large samples if plim b Then the asymptotic distribution of the function on the left hand side is the same as that on the right Thus the mean of the asymptotic distribution is plim f b f and the asymptotic covariance matrix is Asy Var b which gives us the following theorem

    THEOREM 5 2 Asymptotic Distribution of a Function of b If f b is a set of continuous and continuously differentiable functions of b such that f and if Theorem 5 1 holds then f b N f
    a

    2 1 Q n



    5 16

    In practice the estimator of the asymptotic covariance matrix would be Est Asy Var f b C s 2 X X 1 C

    If any of the functions are nonlinear then the property of unbiasedness that holds for b may not carry over to f b Nonetheless it follows from 5 4 that f b is a consistent estimator of f and the asymptotic covariance matrix is readily available
    5 2 5 ASYMPTOTIC EFFICIENCY

    We have not established any large sample counterpart to the Gauss Markov theorem That is it remains to establish whether the large sample properties of the least squares

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    71

    estimator are optimal by any measure The Gauss Markov Theorem establishes nite sample conditions under which least squares is optimal The requirements that the estimator be linear and unbiased limit the theorem s generality however One of the main purposes of the analysis in this chapter is to broaden the class of estimators in the classical model to those which might be biased but which are consistent Ultimately we shall also be interested in nonlinear estimators These cases extend beyond the reach of the Gauss Markov Theorem To make any progress in this direction we will require an alternative estimation criterion

    DEFINITION 5 1 Asymptotic Ef ciency An estimator is asymptotically ef cient if it is consistent asymptotically normally distributed and has an asymptotic covariance matrix that is not larger than the asymptotic covariance matrix of any other consistent asymptotically normally distributed estimator

    In Chapter 17 we will show that if the disturbances are normally distributed then the least squares estimator is also the maximum likelihood estimator Maximum likelihood estimators are asymptotically ef cient among consistent and asymptotically normally distributed estimators This gives us a partial result albeit a somewhat narrow one since to claim it we must assume normally distributed disturbances If some other distribution is speci ed for and it emerges that b is not the maximum likelihood estimator then least squares may not be ef cient
    Example 5 1 The Gamma Regression Model

    Greene 1980a considers estimation in a regression model with an asymmetrically distributed disturbance y P x P x

    where has the gamma distribution in Section B 4 5 see B 39 and P is the standard deviation of the disturbance In this model the covariance matrix of the least squares estimator of the slope coef cients not including the constant term is Asy Var b X 2 X M0 X 1 whereas for the maximum likelihood estimator which is not the least squares estimator Asy Var M L 1 2 P 2 X M0 X 1 6 But for the asymmetry parameter this result would be the same as for the least squares estimator We conclude that the estimator that accounts for the asymmetric disturbance distribution is more ef cient asymptotically

    6 The Matrix M0 produces data in the form of deviations from sample means See Section A 2 8 In Greene s model P must be greater than 2

    Greene 50240

    book

    June 3 2002

    9 59

    72

    CHAPTER 5 Large Sample Properties

    5 3

    MORE GENERAL CASES The asymptotic properties of the estimators in the classical regression model were established in Section 5 2 under the following assumptions A1 A2 A3 A4 A5 Linearity yi xi 1 1 xi 2 2 xi K K i Full rank The n K sample data matrix X has full column rank Exogeneity of the independent variables E i x j 1 x j 2 x j K 0 i j 1 n Homoscedasticity and nonautocorrelation Data generating mechanism independent observations

    The following are the crucial results needed For consistency of b we need 5 1 and 5 4 plim 1 n X X plim Qn Q a positive de nite matrix

    plim 1 n X plim wn E wn 0 For consistency of s 2 we added a fairly weak assumption about the moments of the disturbances To establish asymptotic normality we will require consistency and 5 12 which is n wn N 0 2 Q
    d

    With these in place the desired characteristics are then established by the methods of Section 5 2 To analyze other cases we can merely focus on these three results It is not necessary to reestablish the consistency or asymptotic normality themselves since they follow as a consequence
    5 3 1 HETEROGENEITY IN THE DISTRIBUTIONS OF xi

    Exceptions to the assumptions made above are likely to arise in two settings In a panel data set the sample will consist of multiple observations on each of many observational units For example a study might consist of a set of observations made at different points in time on a large number of families In this case the xs will surely be correlated across observations at least within observational units They might even be the same for all the observations on a single family They are also likely to be a mixture of random variables such as family income and nonstochastic regressors such as a xed family effect represented by a dummy variable The second case would be a time series model in which lagged values of the dependent variable appear on the right hand side of the model The panel data set could be treated as follows Assume for the moment that the data consist of a xed number of observations say T on a set of N families so that the total number of rows in X is n NT The matrix 1 Qn n
    n

    Qi
    i 1

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    73

    in which n is all the observations in the sample could be viewed as 1 Qn N 1 T Qi j
    observations for family i

    i

    1 N

    N

    Qi
    i 1

    where Qi average Qi j for family i We might then view the set of observations on the ith unit as if they were a single observation and apply our convergence arguments to the number of families increasing without bound The point is that the conditions that are needed to establish convergence will apply with respect to the number of observational units The number of observations taken for each observation unit might be xed and could be quite small
    5 3 2 DEPENDENT OBSERVATIONS

    The second dif cult case arises when there are lagged dependent variables among the variables on the right hand side or more generally in time series settings in which the observations are no longer independent or even uncorrelated Suppose that the model may be written yt zt 1 yt 1 p yt p t 5 17

    Since this model is a time series setting we use t instead of i to index the observations We continue to assume that the disturbances are uncorrelated across observations Since yt 1 is dependent on yt 2 and so on it is clear that although the disturbances are uncorrelated across observations the regressor vectors including the lagged ys surely are not Also although Cov xt s 0 if s t xt zt yt 1 yt p Cov xt s 0 if s t Every observation yt is determined by the entire history of the disturbances Therefore we have lost the crucial assumption E X 0 E t future xs is not equal to 0 The conditions needed for the nite sample results we had earlier no longer hold Without Assumption A3 E X 0 our earlier proof of unbiasedness dissolves and without unbiasedness the Gauss Markov theorem no longer applies We are left with only asymptotic results for this case This case is considerably more general than the ones we have considered thus far The theorems we invoked previously do not apply when the observations in the sums are correlated To establish counterparts to the limiting normal distribution of 1 n X and convergence of 1 n X X to a nite positive de nite matrix it is necessary to make additional assumptions about the regressors For the disturbances we replace Assumption A3 following AD3 E t xt s 0 for all s 0

    This assumption states that the disturbance in the period t is an innovation it is new information that enters the process Thus it is not correlated with any of the history It is not uncorrelated with future data however since t will be a part of xt r Assumptions A1 A2 and A4 are retained at least for the present We will also replace Assumption A5 and result 5 1 with two assumptions about the right hand variables

    Greene 50240

    book

    June 3 2002

    9 59

    74

    CHAPTER 5 Large Sample Properties

    First plim 1 T s
    T

    xt xt s Q s
    t s 1

    a nite matrix s 0

    5 18

    and Q 0 is nonsingular if T K Note that Q Q 0 This matrix is the sums of cross products of the elements of xt with lagged values of xt Second we assume that the roots of the polynomial 1 1 z 2 z2 P zp 0 5 19

    are all outside the unit circle See Section 20 2 for further details Heuristically these assumptions imply that the dependence between values of the xs at different points in time varies only with how far apart in time they are not speci cally with the points in time at which observations are made and that the correlation between observations made at different points in time fades suf ciently rapidly that sample moments such as Q s above will converge in probability to a population counterpart 7 Formally we obtain these results with AD5 The series on xt is stationary and ergodic This assumption also implies that Q s becomes a matrix of zeros as s the separation in time becomes large These conditions are suf cient to produce 1 n X 0 and the consistency of b Further results are needed to establish the asymptotic normality of the estimator however 8 In sum the important properties of consistency and asymptotic normality of the least squares estimator are preserved under the different assumptions of stochastic regressors provided that additional assumptions are made In most cases these assumptions are quite benign so we conclude that the two asymptotic properties of least squares considered here consistency and asymptotic normality are quite robust to different speci cations of the regressors

    5 4

    INSTRUMENTAL VARIABLE AND TWO STAGE LEAST SQUARES ESTIMATION The assumption that xi and i are uncorrelated has been crucial in the development thus far But there are any number of applications in economics in which this assumption is untenable Examples include models that contain variables that are measured with error and most dynamic models involving expectations Without this assumption none of the
    will examine some cases in later chapters in which this does not occur To consider a simple example suppose that x contains a constant Then the assumption requires sample means to converge to population parameters Suppose that all observations are correlated Then the variance of x is Var 1 T t xt 1 T 2 t s Cov xt xs Since none of the T 2 terms is assumed to be zero there is no assurance that the double sum converges to zero as T But if the correlations diminish suf ciently with distance in time then the sum may converge to zero
    8 These 7 We

    appear in Mann and Wald 1943 Billingsley 1979 and Dhrymes 1998

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    75

    proofs of consistency given above will hold up so least squares loses its attractiveness as an estimator There is an alternative method of estimation called the method of instrumental variables IV The least squares estimator is a special case but the IV method is far more general The method of instrumental variables is developed around the following general extension of the estimation strategy in the classical regression model Suppose that in the classical model yi xi i the K variables xi may be correlated with i Suppose as well that there exists a set of L variables zi where L is at least as large as K such that zi is correlated with xi but not with i We cannot estimate consistently by using the familiar least squares estimator But we can construct a consistent estimator of by using the assumed relationships among zi xi and i
    Example 5 2 Models in Which Least Squares is Inconsistent

    The following models will appear at various points in this book In general least squares will not be a suitable estimator In Example 13 6 and Section 18 5 we will examine a model for municipal expenditure of the form Si t f Si t 1 i t The disturbances are assumed to be freely correlated across periods so both Si t 1 and i t are correlated with i t 1 It follows that they are correlated with each other which means that this model even with a linear speci cation does not satisfy the assumptions of the classical model The regressors and disturbances are correlated Dynamic Regression In Chapters 19 and 20 we will examine a variety of time series models which are of the form yt f yt 1 t in which t is auto correlated with its past values This case is essentially the same as the one we just considered Since the disturbances are autocorrelated it follows that the dynamic regression implies correlation between the disturbance and a right hand side variable Once again least squares will be inconsistent Consumption Function We and many other authors have used a macroeconomic version of the consumption function at various points to illustrate least squares estimation of the classical regression model But by construction the model violates the assumptions of the classical regression model The national income data are assembled around some basic accounting identities including Y C investment government spending net exports Therefore although the precise relationship between consumption C and income Y C f Y is ambiguous and is a suitable candidate for modeling it is clear that consumption and therefore is one of the main determinants of Y The model Ct Yt t does not t our assumptions for the classical model if Cov Yt t 0 But it is reasonable to assume at least for now that t is uncorrelated with past values of C and Y Therefore in this model we might consider Yt 1 and Ct 1 as suitable instrumental variables Measurement Error In Section 5 6 we will examine an application in which an earnings equation yi t f Educationi t i t is speci ed for sibling pairs twins t 1 2 for n individuals Since education is a variable that is measured with error it will emerge in a way that will be established below that this is once again a case in which the disturbance and an independent variable are correlated None of these models can be consistently estimated by least squares the method of instrumental variables is the standard approach
    Dynamic Panel Data Model

    We will now construct an estimator for in this extended model We will maintain assumption A5 independent observations with nite moments though this is only for convenience These results can all be extended to cases with dependent observations This will preserve the important result that plim X X n Qxx We use the subscript to differentiate this result from the results given below The basic assumptions of the regression model have changed however First A3 no correlation between x and is under our new assumptions AI3 E i xi i

    Greene 50240

    book

    June 3 2002

    9 59

    76

    CHAPTER 5 Large Sample Properties

    We interpret Assumption AI3 to mean that the regressors now provide information about the expectations of the disturbances The important implication of AI3 is that the disturbances and the regressors are now correlated Assumption AI3 implies that E xi i for some nonzero If the data are well behaved then we can apply Theorem D 5 Khinchine s theorem to assert that plim 1 n X Notice that the original model results if i 0 Finally we must characterize the instrumental variables We assume the following AI7 AI8a AI8b AI8c AI9 xi zi i i 1 n are an i i d sequence of random variables 2 E xik Qxx kk a nite constant k 1 K 2 E zil Qzz ll a nite constant l 1 L E zil xik Qzx lk a nite constant l 1 L k 1 K E i zi 0

    In later work in time series models it will be important to relax assumption AI7 Finite means of zl follows from AI8b Using the same analysis as in the preceding section we have plim 1 n Z Z Qzz a nite positive de nite assumed matrix plim 1 n Z X Qzx a nite L K matrix with rank K assumed plim 1 n Z 0 In our statement of the classical regression model we have assumed thus far the special case of i 0 0 follows There is no need to dispense with Assumption AI7 it may well continue to be true but in this special case it becomes irrelevant For this more general model we lose most of the useful results we had for least squares The estimator b is no longer unbiased E b X X X 1 X so the Gauss Markov theorem no longer holds It is also inconsistent plim b plim XX n
    1

    plim

    X n

    Q 1 xx

    The asymptotic distribution is considered in the exercises We now turn to the instrumental variable estimator Since E zi i 0 and all terms have nite variances we can state that plim Zy n plim ZX n plim Z n plim ZX n

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    77

    Suppose that Z has the same number of variables as X For example suppose in our consumption function that xt 1 Yt when zt 1 Yt 1 We have assumed that the rank of Z X is K so now Z X is a square matrix It follows that plim ZX n
    1

    plim

    Zy n



    which leads us to the instrumental variable estimator bIV Z X 1 Z y We have already proved that bIV is consistent We now turn to the asymptotic distribution We will use the same method as in the previous section First 1 Z n which has the same limiting distribution as Q 1 1 n Z Our analysis of 1 n Z zx is the same as that of 1 n X in the previous section so it follows that 1 Z n and ZX n
    1

    n bIV

    ZX n

    1

    N 0 2 Qzz

    d

    1 Z n

    N 0 2 Q 1 Qzz Q 1 zx xz

    d

    This step completes the derivation for the next theorem

    THEOREM 5 3 Asymptotic Distribution of the Instrumental Variables Estimator If Assumptions A1 A2 AI3 A4 AS5 AS5a AI7 AI8a c and AI9 all hold for yi xi zi i where z is a valid set of L K instrumental variables then the asymptotic distribution of the instrumental variables estimator bIV Z X 1 Z y is 2 a bIV N Q 1 Qzz Q 1 5 20 xz n zx where Qzx plim Z X n and Qzz plim Z Z n To estimate the asymptotic covariance matrix we will require an estimator of 2 The natural estimator is 2 1 n
    n

    yi xi bIV 2
    i 1

    Greene 50240

    book

    June 3 2002

    9 59

    78

    CHAPTER 5 Large Sample Properties

    A correction for degrees of freedom as in the development in the previous section is super uous as all results here are asymptotic and 2 would not be unbiased in any event Nonetheless it is standard practice in most software to make the degrees of freedom correction Write the vector of residuals as y XbIV y X Z X 1 Z y Substitute y X and collect terms to obtain I X Z X 1 Z Now 2 n n Z n XZ n
    1

    XX n

    ZX n

    1

    Z n

    2

    X n

    ZX n

    1

    Z n

    We found earlier that we could after a bit of manipulation apply the product result for probability limits to obtain the probability limit of an expression such as this Without repeating the derivation we nd that 2 is a consistent estimator of 2 by virtue of the rst term The second and third product terms converge to zero To complete the derivation then we will estimate Asy Var bIV with Est Asy Var bIV 1 n n ZX n
    1

    ZZ n

    XZ n

    1

    5 21

    2 Z X 1 Z Z X Z 1 There is a remaining detail If Z contains more variables than X then much of the preceding is unusable because Z X will be L K with rank K L and will thus not have an inverse The crucial result in all the preceding is plim Z n 0 That is every column of Z is asymptotically uncorrelated with That also means that every linear combination of the columns of Z is also uncorrelated with which suggests that one approach would be to choose K linear combinations of the columns of Z Which to choose One obvious possibility is simply to choose K variables among the L in Z But intuition correctly suggests that throwing away the information contained in the remaining L K columns is inef cient A better choice is the projection of the columns of X in the column space of Z X Z Z Z 1 Z X We will return shortly to the virtues of this choice With this choice of instrumental variables X for Z we have bIV X X 1 X y X Z Z Z 1 Z X 1 X Z Z Z 1 Z y 5 22

    By substituting X in the expression for Est Asy Var bIV and multiplying it out we see that the expression is unchanged The proofs of consistency and asymptotic normality for this estimator are exactly the same as before because our proof was generic for any valid set of instruments and X quali es

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    79

    There are two reasons for using this estimator one practical one theoretical If any column of X also appears in Z then that column of X is reproduced exactly in X This is easy to show In the expression for X if the kth column in X is one of the columns in Z say the l th then the kth column in Z Z 1 Z X will be the l th column of an L L identity matrix This result means that the kth column in X Z Z Z 1 Z X will be the l th column in Z which is the kth column in X This result is important and useful Consider what is probably the typical application Suppose that the regression contains K variables only one of which say the kth is correlated with the disturbances We have one or more instrumental variables in hand as well as the other K 1 variables that certainly qualify as instrumental variables in their own right Then what we would use is Z X k z1 z2 where we indicate omission of the kth variable by k in the subscript Another useful interpretation of X is that each column is the set of tted values when the corresponding column of X is regressed on all the columns of Z which is obvious from the de nition It also makes clear why each xk that appears in Z is perfectly replicated Every xk provides a perfect predictor for itself without any help from the remaining variables in Z In the example then every column of X except the one that is omitted from X k is replicated exactly whereas the one that is omitted is replaced in X by the predicted values in the regression of this variable on all the zs Of all the different linear combinations of Z that we might choose X is the most ef cient in the sense that the asymptotic covariance matrix of an IV estimator based on a linear combination ZF is smaller when F Z Z 1 Z X than with any other F that uses all L columns of Z a fortiori this result eliminates linear combinations obtained by dropping any columns of Z This important result was proved in a seminal paper by Brundy and Jorgenson 1971 We close this section with some practical considerations in the use of the instrumental variables estimator By just multiplying out the matrices in the expression you can show that bIV X X 1 X y X I Mz X 1 X I Mz y X X 1 X y since I Mz is idempotent Thus when and only when X is the set of instruments the IV estimator is computed by least squares regression of y on X This conclusion suggests only logically one need not actually do this in two steps that bIV can be computed in two steps rst by computing X then by the least squares regression For this reason this is called the two stage least squares 2SLS estimator We will revisit this form of estimator at great length at several points below particularly in our discussion of simultaneous equations models under the rubric of two stage least squares One should be careful of this approach however in the computation of the asymptotic covariance matrix 2 should not be based on X The estimator
    2 sIV

    y XbIV y XbIV n

    is inconsistent for 2 with or without a correction for degrees of freedom An obvious question is where one is likely to nd a suitable set of instrumental variables In many time series settings lagged values of the variables in the model

    Greene 50240

    book

    June 3 2002

    9 59

    80

    CHAPTER 5 Large Sample Properties

    provide natural candidates In other cases the answer is less than obvious The asymptotic variance matrix of the IV estimator can be rather large if Z is not highly correlated with X the elements of Z X 1 grow large Unfortunately there usually is not much choice in the selection of instrumental variables The choice of Z is often ad hoc 9 There is a bit of a dilemma in this result It would seem to suggest that the best choices of instruments are variables that are highly correlated with X But the more highly correlated a variable is with the problematic columns of X the less defensible the claim that these same variables are uncorrelated with the disturbances

    5 5

    HAUSMAN S SPECIFICATION TEST AND AN APPLICATION TO INSTRUMENTAL VARIABLE ESTIMATION It might not be obvious that the regressors in the model are correlated with the disturbances or that the regressors are measured with error If not there would be some bene t to using the least squares estimator rather than the IV estimator Consider a comparison of the two covariance matrices under the hypothesis that both are consistent that is assuming plim 1 n X 0 The difference between the asymptotic covariance matrices of the two estimators is Asy Var bIV Asy Var bLS 2 X Z Z Z 1 Z X plim n n
    1



    2 XX plim n n

    1

    2 plim n X Z Z Z 1 Z X 1 X X 1 n

    To compare the two matrices in the brackets we can compare their inverses The inverse of the rst is X Z Z Z 1 Z X X I MZ X X X X MZ X Since MZ is a nonnegative de nite matrix it follows that X MZ X is also So X Z Z Z 1 Z X equals X X minus a nonnegative de nite matrix Since X Z Z Z 1 Z X is smaller in the matrix sense than X X its inverse is larger Under the hypothesis the asymptotic covariance matrix of the LS estimator is never larger than that of the IV estimator and it will actually be smaller unless all the columns of X are perfectly predicted by regressions on Z Thus we have established that if plim 1 n X 0 that is if LS is consistent then it is a preferred estimator Of course we knew that from all our earlier results on the virtues of least squares Our interest in the difference between these two estimators goes beyond the question of ef ciency The null hypothesis of interest will usually be speci cally whether plim 1 n X 0 Seeking the covariance between X and through 1 n X e is fruitless of course since the normal equations produce 1 n X e 0 In a seminal paper Hausman 1978 suggested an alternative testing strategy Earlier work by Wu 1973 and Durbin 1954 produced what turns out to be the same test The logic of Hausman s approach is as follows Under the null hypothesis we have two consistent estimators of
    9 Results

    on optimal instruments appear in White 2001 and Hansen 1982 In the other direction there is a contemporary literature on weak instruments such as Staiger and Stock 1997

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    81

    bLS and bIV Under the alternative hypothesis only one of these bIV is consistent The suggestion then is to examine d bIV bLS Under the null hypothesis plim d 0 whereas under the alternative plim d 0 Using a strategy we have used at various points before we might test this hypothesis with a Wald statistic H d Est Asy Var d
    1

    d

    The asymptotic covariance matrix we need for the test is Asy Var bIV bLS Asy Var bIV Asy Var bLS Asy Cov bIV bLS Asy Cov bLS bIV At this point the test is straightforward save for the considerable complication that we do not have an expression for the covariance term Hausman gives a fundamental result that allows us to proceed Paraphrased slightly the covariance between an ef cient estimator b E of a parameter vector and its difference from an inef cient estimator b I of the same parameter vector b E b I is zero For our case b E is bLS and b I is bIV By Hausman s result we have Cov b E b E b I Var b E Cov b E b I 0 or Cov b E b I Var b E so Asy Var bIV bLS Asy Var bIV Asy Var bLS Inserting this useful result into our Wald statistic and reverting to our empirical estimates of these quantities we have H bIV bLS Est Asy Var bIV Est Asy Var bLS
    1

    bIV bLS

    Under the null hypothesis we are using two different but consistent estimators of 2 If we use s 2 as the common estimator then the statistic will be H d X X 1 X X 1 1 d s2 5 23

    It is tempting to invoke our results for the full rank quadratic form in a normal vector and conclude the degrees of freedom for this chi squared statistic is K But that method will usually be incorrect and worse yet unless X and Z have no variables in common the rank of the matrix in this statistic is less than K and the ordinary inverse will not even exist In most cases at least some of the variables in X will also appear in Z In almost any application X and Z will both contain the constant term That is some of the variables in X are known to be uncorrelated with the disturbances For example the usual case will involve a single variable that is thought to be problematic or that is measured with error In this case our hypothesis plim 1 n X 0 does not

    Greene 50240

    book

    June 3 2002

    9 59

    82

    CHAPTER 5 Large Sample Properties

    really involve all K variables since a subset of the elements in this vector say K0 are known to be zero As such the quadratic form in the Wald test is being used to test only K K K0 hypotheses It is easy and useful to show that in fact H is a rank K quadratic form Since Z Z Z 1 Z is an idempotent matrix X X X X Using this result and expanding d we nd d X X 1 X y X X 1 X y X X 1 X y X X X X 1 X y X X 1 X y X X X 1 X y X X 1 X e where e is the vector of least squares residuals Recall that K0 of the columns in X are the original variables in X Suppose that these variables are the rst K 0 Thus the rst K 0 rows of X e are the same as the rst K 0 rows of X e which are of course 0 This statement does not mean that the rst K 0 elements of d are zero So we can write d as 0 0 d X X 1 X X 1 q Xe Finally denote the entire matrix in H by W Since that ordinary inverse may not exist this matrix will have to be a generalized inverse see Section A 7 12 Then denoting the whole matrix product by P we obtain H 0 q X X 1 W X X 1 0 0 0 q P q P q q q

    where P is the lower right K K submatrix of P We now have the end result Algebraically H is actually a quadratic form in a K vector so K is the degrees of freedom for the test Since the preceding Wald test requires a generalized inverse see Hausman and Taylor 1981 it is going to be a bit cumbersome In fact one need not actually approach the test in this form and it can be carried out with any regression program The alternative approach devised by Wu 1973 is simpler An F statistic with K and n K K degrees of freedom can be used to test the joint signi cance of the elements of in the augmented regression y X X 5 24

    where X are the tted values in regressions of the variables in X on Z This result is equivalent to the Hausman test for this model Algebraic derivations of this result can be found in the articles and in Davidson and MacKinnon 1993 Although most of the results above are speci c to this test of correlation between some of the columns of X and the disturbances the Hausman test is general To reiterate when we have a situation in which we have a pair of estimators E and I E and I are both consistent and E is ef cient relative to I while such that under H0 under H1 I remains consistent while E is inconsistent then we can form a test of the

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    83

    hypothesis by referring the Hausman statistic H I E Est Asy Var I Est Asy Var E
    1 d I E 2 J

    to the appropriate critical value for the chi squared distribution The appropriate degrees of freedom for the test J will depend on the context Moreover some sort of generalized inverse matrix may be needed for the matrix although in at least one common case the random effects regression model see Chapter 13 the appropriate approach is to extract some rows and columns from the matrix instead The short rank issue is not general Many applications can be handled directly in this form with a full rank quadratic form Moreover the Wu approach is speci c to this application The other applications that we will consider xed and random effects for panel data and the independence from irrelevant alternatives test for the multinomial logit model do not lend themselves to the regression approach and are typically handled using the Wald statistic and the full rank quadratic form As a nal note observe that the short rank of the matrix in the Wald statistic is an algebraic result The failure of the matrix in the Wald statistic to be positive de nite however is sometimes a nite sample problem that is not part of the model structure In such a case forcing a solution by using a generalized inverse may be misleading Hausman suggests that in this instance the appropriate conclusion might be simply to take the result as zero and by implication not reject the null hypothesis
    Example 5 3 Hausman Test for a Consumption Function

    Quarterly data for 1950 1 to 2000 4 on a number of macroeconomic variables appear in Table F5 1 A consumption function of the form Ct Yt t is estimated using the 204 observations on aggregate U S consumption and disposable personal income In Example 5 2 this model is suggested as a candidate for the possibility of bias due to correlation between Yt and t Consider instrumental variables estimation using Yt 1 and Ct 1 as the instruments for Yt and of course the constant term is its own instrument One observation is lost because of the lagged values so the results are based on 203 quarterly observations The Hausman statistic can be computed in two ways 1 Use the Wald statistic in 5 23 with the Moore Penrose generalized inverse The common s2 is the one computed by least squares under the null hypothesis of no correlation With this computation H 22 111 There is K 1 degree of freedom The 95 percent critical value from the chi squared table is 3 84 Therefore we reject the null hypothesis of no correlation between Yt and t 2 Using the Wu statistic based on 5 24 we regress Ct on a constant Yt and the predicted value in a regression of Yt on a constant Yt 1 and Ct 1 The t ratio on the prediction is 4 945 so the F statistic with 1 and 201 degrees of freedom is 24 453 The critical value for this F distribution is 4 15 so again the null hypothesis is rejected

    5 6

    MEASUREMENT ERROR Thus far it has been assumed at least implicitly that the data used to estimate the parameters of our models are true measurements on their theoretical counterparts In practice this situation happens only in the best of circumstances All sorts of measurement problems creep into the data that must be used in our analyses Even carefully constructed survey data do not always conform exactly to the variables the analysts have in mind for their regressions Aggregate statistics such as GDP are only estimates

    Greene 50240

    book

    June 3 2002

    9 59

    84

    CHAPTER 5 Large Sample Properties

    of their theoretical counterparts and some variables such as depreciation the services of capital and the interest rate do not even exist in an agreed upon theory At worst there may be no physical measure corresponding to the variable in our model intelligence education and permanent income are but a few examples Nonetheless they all have appeared in very precisely de ned regression models
    5 6 1 LEAST SQUARES ATTENUATION

    In this section we examine some of the received results on regression analysis with badly measured data The general assessment of the problem is not particularly optimistic The biases introduced by measurement error can be rather severe There are almost no known nite sample results for the models of measurement error nearly all the results that have been developed are asymptotic 10 The following presentation will use a few simple asymptotic results for the classical regression model The simplest case to analyze is that of a regression model with a single regressor and no constant term Although this case is admittedly unrealistic it illustrates the essential concepts and we shall generalize it presently Assume that the model y x 5 25

    conforms to all the assumptions of the classical normal regression model If data on y and x were available then would be estimable by least squares Suppose however that the observed data are only imperfectly measured versions of y and x In the context of an example suppose that y is ln output labor and x is ln capital labor Neither factor input can be measured with precision so the observed y and x contain errors of measurement We assume that y y v
    2 with v N 0 v 2 0 u

    5 26a 5 26b

    x x u with u N



    Assume as well that u and v are independent of each other and of y and x As we shall see adding these restrictions is not suf cient to rescue a bad situation As a rst step insert 5 26a into 5 25 assuming for the moment that only y is measured with error y x v x This result conforms to the assumptions of the classical regression model As long as the regressor is measured properly measurement error on the dependent variable can be absorbed in the disturbance of the regression and ignored To save some cumbersome notation therefore we shall henceforth assume that the measurement error problems concern only the independent variables in the model Consider then the regression of y on the observed x By substituting 5 26b into 5 25 we obtain y x u x w
    10 See

    5 27

    for example Imbens and Hyslop 2001

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    85

    Since x equals x u the regressor in 5 27 is correlated with the disturbance
    2 Cov x w Cov x u u u

    5 28

    This result violates one of the central assumptions of the classical model so we can expect the least squares estimator b 1 n 1 n
    n i 1 xi yi n 2 i 1 xi

    to be inconsistent To nd the probability limits insert 5 25 and 5 26b and use the Slutsky theorem plim b plim 1 n in 1 xi ui xi i plim 1 n in 1 xi ui 2 Q 2 2 Q 1 u Q u

    Since x and u are mutually independent this equation reduces to plim b 5 29

    2 where Q plim 1 n i xi 2 As long as u is positive b is inconsistent with a persistent bias toward zero Clearly the greater the variability in the measurement error the worse the bias The effect of biasing the coef cient toward zero is called attenuation In a multiple regression model matters only get worse Suppose to begin we assume that y X and X X U allowing every observation on every variable to be measured with error The extension of the earlier result is

    plim Hence

    XX n

    Q

    uu

    and

    plim

    Xy n

    Q

    plim b Q

    uu

    1

    Q Q

    uu

    1

    uu

    5 30

    This probability limit is a mixture of all the parameters in the model In the same fashion as before bringing in outside information could lead to identi cation The amount of information necessary is extremely large however and this approach is not particularly promising It is common for only a single variable to be measured with error One might speculate that the problems would be isolated to the single coef cient Unfortunately this situation is not the case For a single bad variable assume that it is the rst the matrix uu is of the form 2 u 0 0 0 0 0 uu 0 0 0 It can be shown that for this special case plim b1 1 2 1 u q 11 5 31a

    Greene 50240

    book

    June 3 2002

    9 59

    86

    CHAPTER 5 Large Sample Properties

    note the similarity of this result to the earlier one and for k 1 plim bk k 1
    2 u q k1 2 1 u q 11

    5 31b

    where q k1 is the k 1 th element in Q 1 11 This result depends on several unknowns and cannot be estimated The coef cient on the badly measured variable is still biased toward zero The other coef cients are all biased as well although in unknown directions A badly measured variable contaminates all the least squares estimates 12 If more than one variable is measured with error there is very little that can be said 13 Although expressions can be derived for the biases in a few of these cases they generally depend on numerous parameters whose signs and magnitudes are unknown and presumably unknowable
    5 6 2 INSTRUMENTAL VARIABLES ESTIMATION

    An alternative set of results for estimation in this model and numerous others is built around the method of instrumental variables Consider once again the errors in variables 2 model in 5 25 and 5 26a b The parameters 2 q and u are not identi ed in terms of the moments of x and y Suppose however that there exists a variable z such that z is correlated with x but not with u For example in surveys of families income is notoriously badly reported partly deliberately and partly because respondents often neglect some minor sources Suppose however that one could determine the total amount of checks written by the head s of the household It is quite likely that this z would be highly correlated with income but perhaps not signi cantly correlated with the errors of measurement If Cov x z is not zero then the parameters of the model become estimable as plim 1 n 1 n
    i i

    yi zi Cov x z xi zi Cov x z

    5 32

    In a multiple regression framework if only a single variable is measured with error then the preceding can be applied to that variable and the remaining variables can serve as their own instruments If more than one variable is measured with error then the rst preceding proposal will be cumbersome at best whereas the second can be applied to each For the general case y X X X U suppose that there exists a matrix of variables Z that is not correlated with the disturbances or the measurement error but is correlated with regressors X Then the instrumental variables estimator based on Z bIV Z X 1 Z y is consistent and asymptotically normally distributed with asymptotic covariance matrix that is estimated with Est Asy Var bIV 2 Z X 1 Z Z X Z 1 For more general cases Theorem 5 3 and the results in Section 5 4 apply
    11 Use

    5 33

    A 66 to invert Q uu Q u e1 u e1 where e1 is the rst column of a K K identity matrix The remaining results are then straightforward point is important to remember when the presence of measurement error is suspected

    12 This

    13 Some

    rm analytic results have been obtained by Levi 1973 Theil 1961 Klepper and Leamer 1983 Garber and Klepper 1980 and Griliches 1986 and Cragg 1997

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties 5 6 3 PROXY VARIABLES

    87

    In some situations a variable in a model simply has no observable counterpart Education intelligence ability and like factors are perhaps the most common examples In this instance unless there is some observable indicator for the variable the model will have to be treated in the framework of missing variables Usually however such an indicator can be obtained for the factors just given years of schooling and test scores of various sorts are familiar examples The usual treatment of such variables is in the measurement error framework If for example income 1 2 education and years of schooling education u then the model of Section 5 6 1 applies The only difference here is that the true variable in the model is latent No amount of improvement in reporting or measurement would bring the proxy closer to the variable for which it is proxying The preceding is a pessimistic assessment perhaps more so than necessary Consider a structural model Earnings 1 2 Experience 3 Industry 4 Ability Ability is unobserved but suppose that an indicator say IQ is If we suppose that IQ is related to Ability through a relationship such as IQ 1 2 Ability v then we may solve the second equation for Ability and insert it in the rst to obtain the reduced form equation Earnings 1 1 2 2 Experience 3 Industry 4 2 IQ v 2 This equation is intrinsically linear and can be estimated by least squares We do not have a consistent estimator of 1 or 4 but we do have one of the coef cients of interest This would appear to solve the problem We should note the essential ingredients we require that the indicator IQ not be related to the other variables in the model and we also require that v not be correlated with any of the variables In this instance some of the parameters of the structural model are identi ed in terms of observable data Note though that IQ is not a proxy variable it is an indicator of the latent variable Ability This form of modeling has gured prominently in the education and educational psychology literature Consider in the preceding small model how one might proceed with not just a single indicator but say with a battery of test scores all of which are indicators of the same latent ability variable It is to be emphasized that a proxy variable is not an instrument or the reverse Thus in the instrumental variables framework it is implied that we do not regress y on Z to obtain the estimates To take an extreme example suppose that the full model was y X X X U Z X W

    Greene 50240

    book

    June 3 2002

    9 59

    88

    CHAPTER 5 Large Sample Properties

    That is we happen to have two badly measured estimates of X The parameters of this model can be estimated without dif culty if W is uncorrelated with U and X but not by regressing y on Z The instrumental variables technique is called for When the model contains a variable such as education or ability the question that naturally arises is If interest centers on the other coef cients in the model why not just discard the problem variable 14 This method produces the familiar problem of an omitted variable compounded by the least squares estimator in the full model being inconsistent anyway Which estimator is worse McCallum 1972 and Wickens 1972 show that the asymptotic bias actually degree of inconsistency is worse if the proxy is omitted even if it is a bad one has a high proportion of measurement error This proposition neglects however the precision of the estimates Aigner 1974 analyzed this aspect of the problem and found as might be expected that it could go either way He concluded however that there is evidence to broadly support use of the proxy
    5 6 4 APPLICATION INCOME AND EDUCATION AND A STUDY OF TWINS

    The traditional model used in labor economics to study the effect of education on income is an equation of the form yi 1 2 agei 3 agei2 4 educationi xi 5 i where yi is typically a wage or yearly income perhaps in log form and xi contains other variables such as an indicator for sex region of the country and industry The literature contains discussion of many possible problems in estimation of such an equation by least squares using measured data Two of them are of interest here 1 Although education is the variable that appears in the equation the data available to researchers usually include only years of schooling This variable is a proxy for education so an equation t in this form will be tainted by this problem of measurement error Perhaps surprisingly so researchers also nd that reported data on years of schooling are themselves subject to error so there is a second source of measurement error For the present we will not consider the rst much more dif cult problem Other variables such as ability we denote these i will also affect income and are surely correlated with education If the earnings equation is estimated in the form shown above then the estimates will be further biased by the absence of this omitted variable For reasons we will explore in Chapter 22 this bias has been called the selectivity effect in recent studies

    2

    Simple cross section studies will be considerably hampered by these problems But in a recent study Ashenfelter and Krueger 1994 analyzed a data set that allowed them with a few simple assumptions to ameliorate these problems Annual twins festivals are held at many places in the United States The largest is held in Twinsburg Ohio The authors interviewed about 500 individuals over the age of 18 at the August 1991 festival Using pairs of twins as their observations enabled them to modify their model as follows Let yi j Ai j denote the earnings and age for
    14 This

    discussion applies to the measurement error and latent variable problems equally

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    89

    twin j j 1 2 for pair i For the education variable only self reported schooling data Si j are available The authors approached the measurement problem in the schooling variable Si j by asking each twin how much schooling they had and how much schooling their sibling had Denote schooling reported by sibling m of sibling j by Si j m So the self reported years of schooling of twin 1 is Si 1 1 When asked how much schooling twin 1 has twin 2 reports Si 1 2 The measurement error model for the schooling variable is Si j m Si j ui j m j m 1 2 where Si j true schooling for twin j of pair i We assume that the two sources of measurement error ui j m are uncorrelated and have zero means Now consider a simple bivariate model such as the one in 5 25 yi j Si j i j As we saw earlier a least squares estimate of using the reported data will be attenuated plim b Var Si j q Var Si j Var ui j j

    Since there is no natural distinction between twin 1 and twin 2 the assumption that the variances of the two measurement errors are equal is innocuous The factor q is sometimes called the reliability ratio In this simple model if the reliability ratio were known then could be consistently estimated In fact this construction of this model allows just that Since the two measurement errors are uncorrelated Corr Si 1 1 Si 1 2 Corr Si 2 2 Si 2 1 Var Si 1 Var Si 1 Var ui 1 1 Var Si 1 Var ui 1 2
    1 2

    q

    In words the correlation between the two reported education attainments measures the reliability ratio The authors obtained values of 0 920 and 0 877 for 298 pairs of identical twins and 0 869 and 0 951 for 92 pairs of fraternal twins thus providing a quick assessment of the extent of measurement error in their schooling data Since the earnings equation is a multiple regression this result is useful for an overall assessment of the problem but the numerical values are not suf cient to undo the overall biases in the least squares regression coef cients An instrumental variables estimator was used for that purpose The estimating equation for yi j ln Wagei j with the least squares LS and instrumental variable IV estimates is as follows yi j 1 2 agei 3 agei2 4 Si j j 5 Sim m 6 sexi 7 racei i j LS 0 088 0 087 0 084 0 204 0 410 IV 0 088 0 087 0 116 0 037 0 206 0 428 In the equation Si j j is the person s report of his or her own years of schooling and Sim m is the sibling s report of the sibling s own years of schooling The problem variable is schooling To obtain consistent estimates the method of instrumental variables was used using each sibling s report of the other sibling s years of schooling as a pair of instrumental variables The estimates reported by the authors are shown below the equation The constant term was not reported and for reasons not given the second schooling variable was not included in the equation when estimated by LS This

    Greene 50240

    book

    June 3 2002

    9 59

    90

    CHAPTER 5 Large Sample Properties

    preliminary set of results is presented to give a comparison to other results in the literature The age schooling and gender effects are comparable with other received results whereas the effect of race is vastly different 40 percent here compared with a typical value of 9 percent in other studies The effect of using the instrumental variable estimator on the estimates of 4 is of particular interest Recall that the reliability ratio was estimated at about 0 9 which suggests that the IV estimate would be roughly 11 percent higher 1 0 9 Since this result is a multiple regression that estimate is only a crude guide The estimated effect shown above is closer to 38 percent The authors also used a different estimation approach Recall the issue of selection bias caused by unmeasured effects The authors reformulated their model as yi j 1 2 agei 3 agei2 4 Si j j 6 sexi 7 racei i i j Unmeasured latent effects such as ability are contained in i Since i is not observable but is it is assumed correlated with other variables in the equation the least squares regression of yi j on the other variables produces a biased set of coef cient estimates The difference between the two earnings equations is yi 1 yi 2 4 Si 1 1 Si 2 2 i 1 i 2 This equation removes the latent effect but it turns out worsens the measurement error problem As before 4 can be estimated by instrumental variables There are two instrumental variables available Si 2 1 and Si 1 2 It is not clear in the paper whether the authors used the two separately or the difference of the two The least squares estimate is 0 092 which is comparable to the earlier estimate The instrumental variable estimate is 0 167 which is nearly 82 percent higher The two reported standard errors are 0 024 and 0 043 respectively With these gures it is possible to carry out Hausman s test H 0 167 0 092 2 4 418 0 0432 0 0242

    The 95 percent critical value from the chi squared distribution with one degree of freedom is 3 84 so the hypothesis that the LS estimator is consistent would be rejected The square root of H 2 102 would be treated as a value from the standard normal distribution from which the critical value would be 1 96 The authors reported a t statistic for this regression of 1 97 The source of the difference is unclear

    5 7

    SUMMARY AND CONCLUSIONS This chapter has completed the description begun in Chapter 4 by obtaining the large sample properties of the least squares estimator The main result is that in large samples the estimator behaves according to a normal distribution and converges in probability to the true coef cient vector We examined several data types with one of the end results being that consistency and asymptotic normality would persist under a variety of broad assumptions about the data We then considered a class of estimators the instrumental variable estimators which will retain the important large sample properties we found earlier consistency and asymptotic normality in cases in which the least squares estima

    Greene 50240

    book

    June 3 2002

    9 59

    CHAPTER 5 Large Sample Properties

    91

    tor is inconsistent Two common applications include dynamic models including panel data models and models of measurement error Key Terms and Concepts
    Asymptotic distribution Asymptotic ef ciency Asymptotic normality Asymptotic covariance Finite sample properties Grenander conditions Hausman s speci cation test Identi cation Indicator Instrumental variable Lindberg Feller central Measurement error Panel data Probability limit Reduced form equation Reliability ratio Speci cation test Stationary process Stochastic regressors Structural model Two stage least squares

    matrix
    Asymptotic properties Attenuation Consistency Dynamic regression Ef cient scale Ergodic

    limit theorem
    Maximum likelihood

    estimator
    Mean square convergence

    Exercises 1 For the classical normal regression model y X with no constant term and R2 K K regressors what is plim F K n K plim 1 R2 n K assuming that the true value of is zero Let ei be the ith residual in the ordinary least squares regression of y on X in the classical regression model and let i be the corresponding true disturbance Prove that plim ei i 0 For the simple regression model yi i i N 0 2 prove that the sample mean is consistent and asymptotically normally distributed Now consider the i alternative estimator i wi yi wi n n 1 2 i Note that i wi 1 ii Prove that this is a consistent estimator of and obtain its asymptotic variance Hint i i 2 n n 1 2n 1 6 In the discussion of the instrumental variables estimator we showed that the least squares estimator b is biased and inconsistent Nonetheless b does estimate something plim b Q 1 Derive the asymptotic covariance matrix of b and show that b is asymptotically normally distributed For the model in 5 25 and 5 26 prove that when only x is measured with error the squared correlation between y and x is less than that between y and x Note the assumption that y y Does the same hold true if y is also measured with error Christensen and Greene 1976 estimated a generalized Cobb Douglas cost function of the form ln C Pf ln Q ln2 Q 2 k ln Pk Pf l ln Pl Pf Pk Pl and Pf indicate unit prices of capital labor and fuel respectively Q is output and C is total cost The purpose of the generalization was to produce a U shaped average total cost curve See Example 7 3 for discussion of Nerlove s 1963 predecessor to this study We are interested in the output at which the cost curve reaches its minimum That is the point at which ln C ln Q Q Q 1 or Q exp 1 The estimated regression model using the Christensen

    2

    3

    4

    5

    6

    Greene 50240

    book

    June 3 2002

    9 59

    92

    CHAPTER 5 Large Sample Properties

    and Greene 1970 data are as follows where estimated standard errors are given in parentheses ln C Pf 7 294 0 39091 ln Q 0 062413 ln2 Q 2 0 34427 0 036988 0 0051548 0 07479 ln Pk Pf 0 2608 ln Pl Pf e 0 061645 0 068109 The estimated asymptotic covariance of the estimators of and is 0 000187067 R2 0 991538 and e e 2 443509 Using the estimates given above compute the estimate of this ef cient scale Compute an estimate of the asymptotic standard error for this estimate then form a con dence interval for the estimated ef cient scale The data for this study are given in Table F5 2 Examine the raw data and determine where in the sample the ef cient scale lies That is how many rms in the sample have reached this scale and is this scale large in relation to the sizes of rms in the sample The consumption function used in Example 5 3 is a very simple speci cation One might wonder if the meager speci cation of the model could help explain the nding in the Hausman test The data set used for the example are given in Table F5 1 Use these data to carry out the test in a more elaborate speci cation ct 1 2 yt 3 i t 4 ct 1 t where ct is the log of real consumption yt is the log of real disposable income and i t is the interest rate 90 day T bill rate Suppose we change the assumptions of the model to AS5 xi are an independent and identically distributed sequence of random vectors such that xi has a nite mean vector x nite positive de nite covariance matrix xx and nite fourth moments E x j xk xl xm jklm for all variables How does the proof of consistency and asymptotic normality of b change Are these assumptions weaker or stronger than the ones made in Section 5 2 Now assume only nite second moments of x E xi2 is nite Is this suf cient to establish consistency of b Hint the Cauchy Schwartz inequality Theorem D 13 1 2 1 2 E xy E x 2 will be helpful Is this assumption suf cient to E y2 establish asymptotic normality

    7

    8

    9

    Greene 50240

    book

    June 3 2002

    10 1

    6

    INFERENCE AND PREDICTION

    Q
    6 1 INTRODUCTION The linear regression model is used for three major functions estimation which was the subject of the previous three chapters and most of the rest of this book hypothesis testing and prediction or forecasting In this chapter we will examine some applications of hypothesis tests using the classical model The basic statistical theory was developed in Chapters 4 5 and Appendix C so the methods discussed here will use tools that are already familiar After the theory is developed in Sections 6 2 6 4 we will examine some applications in Sections 6 4 and 6 5 We will be primarily concerned with linear restrictions in this chapter and will turn to nonlinear restrictions near the end of the chapter in Section 6 5 Section 6 6 discusses the third major use of the regression model prediction

    6 2

    RESTRICTIONS AND NESTED MODELS One common approach to testing a hypothesis is to formulate a statistical model that contains the hypothesis as a restriction on its parameters A theory is said to have testable implications if it implies some testable restrictions on the model Consider for example a simple model of investment It suggested by Section 3 3 2 ln It 1 2 i t 3 pt 4 ln Yt 5 t t 6 1

    which states that investors are sensitive to nominal interest rates i t the rate of in ation pt the log of real output ln Yt and other factors which trend upward through time embodied in the time trend t An alternative theory states that investors care about real interest rates The alternative model is ln It 1 2 i t pt 3 pt 4 ln Yt 5 t t 6 2

    Although this new model does embody the theory the equation still contains both nominal interest and in ation The theory has no testable implication for our model But consider the stronger hypothesis investors care only about real interest rates The resulting equation ln It 1 2 i t pt 4 ln Yt 5 t t 6 3 is now restricted in the context of the rst model the implication is that 2 3 0 The stronger statement implies something speci c about the parameters in the equation that may or may not be supported by the empirical evidence
    93

    Greene 50240

    book

    June 3 2002

    10 1

    94

    CHAPTER 6 Inference and Prediction

    The description of testable implications in the preceding paragraph suggests correctly that testable restrictions will imply that only some of the possible models contained in the original speci cation will be valid that is consistent with the theory In the example given earlier equation 6 1 speci es a model in which there are ve unrestricted parameters 1 2 3 4 5 But equation 6 3 shows that only some values are consistent with the theory that is those for which 3 2 This subset of values is contained within the unrestricted set In this way the models are said to be nested Consider a different hypothesis investors do not care about in ation In this case the smaller set of coef cients is 1 2 0 4 5 Once again the restrictions imply a valid parameter space that is smaller has fewer dimensions than the unrestricted one The general result is that the hypothesis speci ed by the restricted model is contained within the unrestricted model Now consider an alternative pair of models Model0 Investors care only about in ation Model1 Investors care only about the nominal interest rate In this case the two parameter vectors are 1 0 3 4 5 by Model0 and 1 2 0 4 5 by Model1 In this case the two speci cations are both subsets of the unrestricted model but neither model is obtained as a restriction on the other They have the same number of parameters they just contain different variables These two models are nonnested We are concerned only with nested models in this chapter Nonnested models are considered in Section 8 3 Beginning with the linear regression model y X we consider a set of linear restrictions of the form r11 1 r12 2 r1 K K q1 r21 1 r22 2 r2 K K q2 r J 1 1 r J 2 2 r J K K qJ These can be combined into the single equation R q Each row of R is the coef cients in one of the restrictions The matrix R has K columns to be conformable with J rows for a total of J restrictions and full row rank so J must be less than or equal to K The rows of R must be linearly independent Although it does not violate the condition the case of J K must also be ruled out 1 The restriction R q imposes J restrictions on K otherwise free parameters Hence with the restrictions imposed there are in principle only K J free parameters remaining One way to view this situation is to partition R into two groups of columns one with J and one with K J so that the rst set are linearly independent There are many ways to do so any one will do for the present Then with likewise partitioned and its elements
    1 If the K slopes satisfy

    J K restriction then R is square and nonsingular and R 1 q There is no estimation

    or inference problem

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    95

    reordered in whatever way is needed we may write R R1 1 R2 2 q If the J columns of R1 are independent then 1 R 1 q R2 2 1 6 4 The implication is that although 2 is free to vary once 2 is determined 1 is determined by 6 4 Thus only the K J elements of 2 are free parameters in the restricted model

    6 3

    TWO APPROACHES TO TESTING HYPOTHESES Hypothesis testing of the sort suggested above can be approached from two viewpoints First having computed a set of parameter estimates we can ask whether the estimates come reasonably close to satisfying the restrictions implied by the hypothesis More formally we can ascertain whether the failure of the estimates to satisfy the restrictions is simply the result of sampling error or is instead systematic An alternative approach might proceed as follows Suppose that we impose the restrictions implied by the theory Since unrestricted least squares is by de nition least squares this imposition must lead to a loss of t We can then ascertain whether this loss of t results merely from sampling error or whether it is so large as to cast doubt on the validity of the restrictions We will consider these two approaches in turn then show that as one might hope within the framework of the linear regression model the two approaches are equivalent

    AN IMPORTANT ASSUMPTION To develop the test statistics in this section we will assume normally distributed disturbances As we saw in Chapter 4 with this assumption we will be able to obtain the exact distributions of the test statistics In the next section we will consider the implications of relaxing this assumption and develop an alternative set of results that allows us to proceed without it

    6 3 1

    THE F STATISTIC AND THE LEAST SQUARES DISCREPANCY

    We now consider testing a set of J linear restrictions stated in the null hypothesis H0 R q 0 against the alternative hypothesis H1 R q 0 Each row of R is the coef cients in a linear restriction on the coef cient vector Typically R will have only a few rows and numerous zeros in each row Some examples would be as follows 1 One of the coef cients is zero j 0 R 0 0 1 0 0 and q 0

    Greene 50240

    book

    June 3 2002

    10 1

    96

    CHAPTER 6 Inference and Prediction

    2

    Two of the coef cients are equal k j R 0 0 1 1 0 and q 0

    3

    A set of the coef cients sum to one 2 3 4 1 R 0 1 1 1 0 and q 1

    4

    A subset of the coef cients are all zero 1 0 2 0 and 3 0 1 0 0 0 0 0 R 0 1 0 0 0 I 0 and q 0 0 0 1 0 0 0 Several linear restrictions 2 3 011 R 0 0 0 000 1 4 6 0 and 5 6 0 000 1 1 0 1 and q 0 011 0

    5

    6

    All the coef cients in the model except the constant term are zero See 4 15 and Section 4 7 4 R 0 I K 1 and q 0

    Given the least squares estimator b our interest centers on the discrepancy vector Rb q m It is unlikely that m will be exactly 0 The statistical question is whether the deviation of m from 0 can be attributed to sampling error or whether it is signi cant Since b is normally distributed see 4 8 and m is a linear function of b m is also normally distributed If the null hypothesis is true then R q 0 and m has mean vector E m X R E b X q R q 0 and covariance matrix Var m X Var Rb q X R Var b X R 2 R X X 1 R We can base a test of H0 on the Wald criterion W m Var m X
    2 1

    m 6 5

    Rb q R X X 1 R 1 Rb q Rb q R X X 1 R 1 Rb q 2 2 J

    The statistic W has a chi squared distribution with J degrees of freedom if the hypothesis is correct 2 Intuitively the larger m is that is the worse the failure of least squares to satisfy the restrictions the larger the chi squared statistic Therefore a large chisquared value will weigh against the hypothesis
    2 This

    calculation is an application of the full rank quadratic form of Section B 10 5

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    97

    The chi squared statistic in 6 5 is not usable because of the unknown 2 By using s instead of 2 and dividing the result by J we obtain a usable F statistic with J and n K degrees of freedom Making the substitution in 6 5 dividing by J and multiplying and dividing by n K we obtain
    2

    F

    W 2 J s2 Rb q R X X 1 R 1 Rb q 2 1 J 2 s2 n K n K 6 6

    Rb q 2 R X X 1 R 1 Rb q J n K s 2 2 n K

    If R q that is if the null hypothesis is true then Rb q Rb R R b R X X 1 X See 4 4 Let C R X X 1 R since R b R X X 1 X D

    the numerator of F equals T J where T D C 1 D The numerator is W J from 6 5 and is distributed as 1 J times a chi squared J as we showed earlier We found in 4 6 that s 2 e e n K M n K where M is an idempotent matrix Therefore the denominator of F equals M n K This statistic is distributed as 1 n K times a chi squared n K See 4 11 Therefore the F statistic is the ratio of two chi squared variables each divided by its degrees of freedom Since M and T are both normally distributed and their covariance TM is 0 the vectors of the quadratic forms are independent The numerator and denominator of F are functions of independent random vectors and are therefore independent This completes the proof of the F distribution See B 35 Canceling the two appearances of 2 in 6 6 leaves the F statistic for testing a linear hypothesis Rb q R s 2 X X 1 R J For testing one linear restriction of the form F J n K
    1

    Rb q



    H0 r1 1 r2 2 r K K r q usually some of the rs will be zero the F statistic is F 1 n K
    j

    j r j b j q 2 kr j r k Est Cov b j bk

    6 7

    If the hypothesis is that the jth coef cient is equal to a particular value then R has a single row with a 1 in the jth position R X X 1 R is the jth diagonal element of the inverse matrix and Rb q is b j q The F statistic is then F 1 n K b j q 2 Est Var b j

    Consider an alternative approach The sample estimate of r is r1 b1 r2 b2 r K bK r b q

    Greene 50240

    book

    June 3 2002

    10 1

    98

    CHAPTER 6 Inference and Prediction

    If q differs signi cantly from q then we conclude that the sample data are not consistent with the hypothesis It is natural to base the test on t q q se q 6 8

    We require an estimate of the standard error of q Since q is a linear function of b and we have an estimate of the covariance matrix of b s 2 X X 1 we can estimate the variance of q with Est Var q X r s 2 X X 1 r The denominator of t is the square root of this quantity In words t is the distance in standard error units between the hypothesized function of the true coef cients and the same function of our estimates of them If the hypothesis is true then our estimates should re ect that at least within the range of sampling variability Thus if the absolute value of the preceding t ratio is larger than the appropriate critical value then doubt is cast on the hypothesis There is a useful relationship between the statistics in 6 7 and 6 8 We can write the square of the t statistic as r b q r s 2 X X 1 r q q 2 t Var q q X 1
    2 1

    r b q



    It follows therefore that for testing a single restriction the t statistic is the square root of the F statistic that would be used to test that hypothesis
    Example 6 1 Restricted Investment Equation

    Section 6 2 suggested a theory about the behavior of investors that they care only about real interest rates If investors were only interested in the real rate of interest then equal increases in interest rates and the rate of in ation would have no independent effect on investment The null hypothesis is H0 2 3 0 Estimates of the parameters of equations 6 1 and 6 3 using 1950 1 to 2000 4 quarterly data on real investment real gdp an interest rate the 90 day T bill rate and in ation measured by the change in the log of the CPI see Appendix Table F5 1 are given in Table 6 1 One observation is lost in computing the change in the CPI

    TABLE 6 1

    Estimated Investment Equations Estimated standard errors in parentheses
    1 2 3 4 5

    Model 6 1

    9 135 0 00860 0 00331 1 930 1 366 0 00319 0 00234 0 183 s 0 08618 R2 0 979753 e e 1 47052 Est Cov b2 b3 3 718e 6 7 907 1 201 s 0 8670 0 00443 0 00443 1 764 0 00227 0 00227 0 161 R2 0 979405 e e 1 49578

    0 00566 0 00149

    Model 6 3

    0 00440 0 00133

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    99

    To form the appropriate test statistic we require the standard error of q b2 b3 which is se q 0 003192 0 002342 2 3 718 10 6 1 2 0 002866 The t ratio for the test is therefore t 0 00860 0 00331 1 845 0 002866

    Using the 95 percent critical value from t 203 5 1 96 the standard normal value we conclude that the sum of the two coef cients is not signi cantly different from zero so the hypothesis should not be rejected There will usually be more than one way to formulate a restriction in a regression model One convenient way to parameterize a constraint is to set it up in such a way that the standard test statistics produced by the regression can be used without further computation to test the hypothesis In the preceding example we could write the regression model as speci ed in 6 2 Then an equivalent way to test H0 would be to t the investment equation with both the real interest rate and the rate of in ation as regressors and to test our theory by simply testing the hypothesis that 3 equals zero using the standard t statistic that is routinely computed When the regression is computed this way b3 0 00529 and the estimated standard error is 0 00287 resulting in a t ratio of 1 844 Exercise Suppose that the nominal interest rate rather than the rate of in ation were included as the extra regressor What do you think the coef cient and its standard error would be Finally consider a test of the joint hypothesis 2 3 0 4 1 5 0 Then 0 R 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 q 1 0 and Rb q 0 0053 0 9302 0 0057 investors consider the real interest rate the marginal propensity to invest equals 1 there is no time trend

    Inserting these values in F yields F 109 84 The 5 percent critical value for F 3 199 from the table is 2 60 We conclude therefore that these data are not consistent with the hypothesis The result gives no indication as to which of the restrictions is most in uential in the rejection of the hypothesis If the three restrictions are tested one at a time the t statistics in 6 8 are 1 844 5 076 and 3 803 Based on the individual test statistics therefore we would expect both the second and third hypotheses to be rejected
    6 3 2 THE RESTRICTED LEAST SQUARES ESTIMATOR

    A different approach to hypothesis testing focuses on the t of the regression Recall that the least squares vector b was chosen to minimize the sum of squared deviations e e Since R2 equals 1 e e y M0 y and y M0 y is a constant that does not involve b it follows that b is chosen to maximize R2 One might ask whether choosing some other value for the slopes of the regression leads to a signi cant loss of t For example in the investment equation in Example 6 1 one might be interested in whether assuming the hypothesis that investors care only about real interest rates leads to a substantially worse t than leaving the model unrestricted To develop the test statistic we rst examine the computation of the least squares estimator subject to a set of restrictions

    Greene 50240

    book

    June 3 2002

    10 1

    100

    CHAPTER 6 Inference and Prediction

    Suppose that we explicitly impose the restrictions of the general linear hypothesis in the regression The restricted least squares estimator is obtained as the solution to Minimizeb0 S b0 y Xb0 y Xb0 subject to Rb0 q 6 9

    A Lagrangean function for this problem can be written L b0 y Xb0 y Xb0 2 Rb0 q 3 The solutions b and will satisfy the necessary conditions L 2X y Xb 2R 0 b L 2 Rb q 0 XX R or Ad v Assuming that the partitioned matrix in brackets is nonsingular the restricted least squares estimator is the upper part of the solution d A 1 v 6 13 R 0 Xy b q 6 10

    6 11

    Dividing through by 2 and expanding terms produces the partitioned matrix equation 6 12

    If in addition X X is nonsingular then explicit solutions for b and may be obtained by using the formula for the partitioned inverse A 74 4 b b X X 1 R R X X 1 R 1 Rb q b Cm and R X X 1 R 1 Rb q Greene and Seaks 1991 show that the covariance matrix for b is simply 2 times the upper left block of A 1 Once again in the usual case in which X X is nonsingular an explicit formulation may be obtained Var b X 2 X X 1 2 X X 1 R R X X 1 R 1 R X X 1 Thus Var b X Var b X a nonnegative de nite matrix
    3 Since is not restricted we can formulate the constraints in terms of 2 Why this scaling is convenient will be clear shortly 4 The

    6 14

    6 15

    general solution given for d may be usable even if X X is singular Suppose for example that X X is 4 4 with rank 3 Then X X is singular But if there is a parametric restriction on then the 5 5 matrix in brackets may still have rank 5 This formulation and a number of related results are given in Greene and Seaks 1991

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    101

    One way to interpret this reduction in variance is as the value of the information contained in the restrictions Note that the explicit solution for involves the discrepancy vector Rb q If the unrestricted least squares estimator satis es the restriction the Lagrangean multipliers will equal zero and b will equal b Of course this is unlikely The constrained solution b is equal to the unconstrained solution b plus a term that accounts for the failure of the unrestricted solution to satisfy the constraints
    6 3 3 THE LOSS OF FIT FROM RESTRICTED LEAST SQUARES

    To develop a test based on the restricted least squares estimator we consider a single coef cient rst then turn to the general case of J linear restrictions Consider the change in the t of a multiple regression when a variable z is added to a model that already contains K 1 variables x We showed in Section 3 5 Theorem 3 6 3 29 that the effect on the t would be given by
    2 2 2 2 RXz RX 1 RX r yz

    6 16

    2 2 where RXz is the new R2 after z is added RX is the original R2 and r yz is the partial correlation between y and z controlling for x So as we knew the t improves or at the least does not deteriorate In deriving the partial correlation coef cient between y and z in 3 23 we obtained the convenient result 2 r yz 2 tz 2 tz n K

    6 17

    2 where tz is the square of the t ratio for testing the hypothesis that the coef cient on z is 2 zero in the multiple regression of y on X and z If we solve 6 16 for r yz and 6 17 for 2 tz and then insert the rst solution in the second then we obtain the result 2 tz 2 2 RXz RX 1 2 1 RXz n K

    6 18

    We saw at the end of Section 6 3 1 that for a single restriction such as z 0 F 1 n K t 2 n K which gives us our result That is in 6 18 we see that the squared t statistic i e the F statistic is computed using the change in the R2 By interpreting the preceding as the result of removing z from the regression we see that we have proved a result for the case of testing whether a single slope is zero But the preceding result is general The test statistic for a single linear restriction is the square of the t ratio in 6 8 By this construction we see that for a single restriction F is a measure of the loss of t that results from imposing that restriction To obtain this result we will proceed to the general case of J linear restrictions which will include one restriction as a special case The t of the restricted least squares coef cients cannot be better than that of the unrestricted solution Let e equal y Xb Then using a familiar device e y Xb X b b e X b b The new sum of squared deviations is e e e e b b X X b b e e

    Greene 50240

    book

    June 3 2002

    10 1

    102

    CHAPTER 6 Inference and Prediction

    The middle term in the expression involves X e which is zero The loss of t is e e e e Rb q R X X 1 R 1 Rb q 6 19

    This expression appears in the numerator of the F statistic in 6 7 Inserting the remaining parts we obtain F J n K e e e e J e e n K
    i yi

    6 20 y 2 we obtain the

    Finally by dividing both numerator and denominator of F by general result F J n K
    2 R2 R J 1 R2 n K

    6 21

    This form has some intuitive appeal in that the difference in the ts of the two models is directly incorporated in the test statistic As an example of this approach consider the earlier joint test that all of the slopes in the model are zero This is the overall F ratio 2 discussed in Section 4 7 4 4 15 where R 0 For imposing a set of exclusion restrictions such as k 0 for one or more coef cients the obvious approach is simply to omit the variables from the regression and base the test on the sums of squared residuals for the restricted and unrestricted regressions The F statistic for testing the hypothesis that a subset say 2 of the coef cients are all zero is constructed using R 0 I q 0 and J K2 the number of elements in 2 The matrix R X X 1 R is the K2 K2 lower right block of the full inverse matrix Using our earlier results for partitioned inverses and the results of Section 3 3 we have R X X 1 R X2 M1 X2 1 and Rb q b2 Inserting these in 6 19 gives the loss of t that results when we drop a subset of the variables from the regression e e e e b2 X2 M1 X2 b2 The procedure for computing the appropriate F statistic amounts simply to comparing the sums of squared deviations from the short and long regressions which we saw earlier
    Example 6 2 Production Function

    The data in Appendix Table F6 1 have been used in several studies of production functions 5 Least squares regression of log output value added on a constant and the logs of labor and capital produce the estimates of a Cobb Douglas production function shown in Table 6 2 We will construct several hypothesis tests based on these results A generalization of the
    5 The data are statewide observations on SIC 33 the primary metals industry They were originally constructed

    by Hildebrand and Liu 1957 and have subsequently been used by a number of authors notably Aigner Lovell and Schmidt 1977 The 28th data point used in the original study is incomplete we have used only the remaining 27

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    103

    TABLE 6 2

    Estimated Production Functions
    Translog Cobb Douglas

    Sum of squared residuals Standard error of regression R squared Adjusted R squared Number of observations
    Variable Coef cient

    0 67993 0 17994 0 95486 0 94411 27
    Standard Error t Ratio

    0 85163 0 18840 0 94346 0 93875 27
    Coef cient Standard Error t Ratio

    Constant ln L ln K 1 ln2 L 2 1 ln2 K 2 ln L ln K

    0 944196 3 61363 1 89311 0 96406 0 08529 0 31239

    2 911 1 548 1 016 0 7074 0 2926 0 4389

    0 324 2 334 1 863 1 363 0 291 0 712

    1 171 0 6030 0 3757

    0 3268 0 1260 0 0853

    3 583 4 787 4 402

    Estimated Covariance Matrix for Translog Cobb Douglas Coef cient Estimates Constant Constant lnL lnK
    1 ln2 L 2 1 ln2 K 2

    ln L

    ln K

    1 2

    ln2 L

    1 2

    ln2 K

    ln L ln K

    lnL lnK

    8 472 0 1068 2 388 0 01984 0 3313 0 00189 0 08760 0 2332 0 3635

    2 397 0 01586 1 231 00961 0 6658 0 03477 0 1831

    1 033 0 00728 0 5231 0 02637 0 2255

    0 5004 0 1467 0 2880

    0 08562 0 1160

    0 1927

    Cobb Douglas model is the translog model 6 which is ln Y 1 2 ln L 3 ln K 4
    1 2

    ln2 L 5

    1 2

    ln2 K 6 ln L ln K

    As we shall analyze further in Chapter 14 this model differs from the Cobb Douglas model in that it relaxes the Cobb Douglas s assumption of a unitary elasticity of substitution The Cobb Douglas model is obtained by the restriction 4 5 6 0 The results for the two regressions are given in Table 6 2 The F statistic for the hypothesis of a Cobb Douglas model is F 3 21 0 85163 0 67993 3 1 768 0 67993 21

    The critical value from the F table is 3 07 so we would not reject the hypothesis that a Cobb Douglas model is appropriate The hypothesis of constant returns to scale is often tested in studies of production This hypothesis is equivalent to a restriction that the two coef cients of the Cobb Douglas production function sum to 1 For the preceding data F 1 24
    6 Berndt

    0 6030 0 3757 1 2 0 1157 0 01586 0 00728 2 0 00961

    and Christensen 1973 See Example 2 5 for discussion

    Greene 50240

    book

    June 3 2002

    10 1

    104

    CHAPTER 6 Inference and Prediction

    which is substantially less than the critical value given earlier We would not reject the hypothesis the data are consistent with the hypothesis of constant returns to scale The equivalent test for the translog model would be 2 3 1 and 4 5 2 6 0 The F statistic with 2 and 21 degrees of freedom is 1 8891 which is less than the critical value of 3 49 Once again the hypothesis is not rejected In most cases encountered in practice it is possible to incorporate the restrictions of a hypothesis directly on the regression and estimate a restricted model 7 For example to impose the constraint 2 1 on the Cobb Douglas model we would write ln Y 1 1 0 ln L 3 ln K or ln Y ln L 1 3 ln K Thus the restricted model is estimated by regressing ln Y ln L on a constant and ln K Some care is needed if this regression is to be used to compute an F statistic If the F statistic is computed using the sum of squared residuals see 6 20 then no problem will arise If 6 21 is used instead however then it may be necessary to account for the restricted regression having a different dependent variable from the unrestricted one In the preceding regression the dependent variable in the unrestricted regression is ln Y whereas in the restricted regression it is ln Y ln L The R2 from the restricted regression is only 0 26979 which would imply an F statistic of 285 96 whereas the correct value is 9 375 If we compute 2 the appropriate R using the correct denominator however then its value is 0 94339 and the correct F value results Note that the coef cient on ln K is negative in the translog model We might conclude that the estimated output elasticity with respect to capital now has the wrong sign This conclusion would be incorrect however in the translog model the capital elasticity of output is ln Y 3 5 ln K 6 ln L ln K If we insert the coef cient estimates and the mean values for ln K and ln L not the logs of the means of 7 44592 and 5 7637 respectively then the result is 0 5425 which is quite in line with our expectations and is fairly close to the value of 0 3757 obtained for the Cobb Douglas model The estimated standard error for this linear combination of the least squares estimates is computed as the square root of Est Var b3 b5 ln K b6 ln L w Est Var b w where w 0 0 1 0 ln K ln L and b is the full 6 1 least squares coef cient vector This value is 0 1122 which is reasonably close to the earlier estimate of 0 0853

    6 4

    NONNORMAL DISTURBANCES AND LARGE SAMPLE TESTS The distributions of the F t and chi squared statistics that we used in the previous section rely on the assumption of normally distributed disturbances Without this assumption
    7 This

    case is not true when the restrictions are nonlinear We consider this issue in Chapter 9

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    105

    the exact distributions of these statistics depend on the data and the parameters and are not F t and chi squared At least at rst blush it would seem that we need either a new set of critical values for the tests or perhaps a new set of test statistics In this section we will examine results that will generalize the familiar procedures These large sample results suggest that although the usual t and F statistics are still usable in the more general case without the special assumption of normality they are viewed as approximations whose quality improves as the sample size increases By using the results of Section D 3 on asymptotic distributions and some large sample results for the least squares estimator we can construct a set of usable inference procedures based on already familiar computations Assuming the data are well behaved the asymptotic distribution of the least squares coef cient estimator b is given by b N
    a

    2 1 Q n

    where Q plim

    XX n

    6 22

    The interpretation is that absent normality of as the sample size n grows the normal distribution becomes an increasingly better approximation to the true though at this point unknown distribution of b As n increases the distribution of n b converges exactly to a normal distribution which is how we obtain the nite sample approximation above This result is based on the central limit theorem and does not require normally distributed disturbances The second result we will need concerns the estimator of 2 plim s 2 2 where s 2 e e n K

    With these in place we can obtain some large sample results for our test statistics that suggest how to proceed in a nite sample with nonnormal disturbances The sample statistic for testing the hypothesis that one of the coef cients k equals 0 a particular value k is 0 n bk k tk 1 s 2 X X n kk Note that two occurrences of n cancel to produce our familiar result Under the null hypothesis with normally distributed disturbances tk is exactly distributed as t with n K degrees of freedom See Theorem 4 4 and 4 13 The exact distribution of this statistic is unknown however if is not normally distributed From the results above we nd that the denominator of tk converges to 2 Q 1 Hence if tk has a limiting kk distribution then it is the same as that of the statistic that has this latter quantity in the denominator That is the large sample distribution of tk is the same as that of 0 n bk k k 2 Q 1 kk But k bk E bk Asy Var bk from the asymptotic normal distribution under 0 the hypothesis k k so it follows that k has a standard normal asymptotic distribution and this result is the large sample distribution of our t statistic Thus as a largesample approximation we will use the standard normal distribution to approximate
    1 2

    Greene 50240

    book

    June 3 2002

    10 1

    106

    CHAPTER 6 Inference and Prediction

    the true distribution of the test statistic tk and use the critical values from the standard normal distribution for testing hypotheses The result in the preceding paragraph is valid only in large samples For moderately sized samples it provides only a suggestion that the t distribution may be a reasonable approximation The appropriate critical values only converge to those from the standard normal and generally from above although we cannot be sure of this In the interest of conservatism that is in controlling the probability of a type I error one should generally use the critical value from the t distribution even in the absence of normality Consider for example using the standard normal critical value of 1 96 for a two tailed test of a hypothesis based on 25 degrees of freedom The nominal size of this test is 0 05 The actual size of the test however is the true but unknown probability that tk 1 96 which is 0 0612 if the t 25 distribution is correct and some other value if the disturbances are not normally distributed The end result is that the standard t test retains a large sample validity Little can be said about the true size of a test based on the t distribution unless one makes some other equally narrow assumption about but the t distribution is generally used as a reliable approximation We will use the same approach to analyze the F statistic for testing a set of J linear restrictions Step 1 will be to show that with normally distributed disturbances JF converges to a chi squared variable as the sample size increases We will then show that this result is actually independent of the normality of the disturbances it relies on the central limit theorem Finally we consider as above the appropriate critical values to use for this test statistic which only has large sample validity The F statistic for testing the validity of J linear restrictions R q 0 is given in 6 6 With normally distributed disturbances and under the null hypothesis the exact distribution of this statistic is F J n K To see how F behaves more generally divide the numerator and denominator in 6 6 by 2 and rearrange the fraction slightly so Rb q R 2 X X 1 R F J s 2 2
    1

    Rb q



    6 23

    Since plim s 2 2 and plim X X n Q the denominator of F converges to J and the bracketed term in the numerator will behave the same as 2 n RQ 1 R Hence regardless of what this distribution is if F has a limiting distribution then it is the same as the limiting distribution of W 1 Rb q R 2 n Q 1 R 1 Rb q J 1 Rb q Asy Var Rb q J
    1

    Rb q

    This expression is 1 J times a Wald statistic based on the asymptotic distribution The large sample distribution of W will be that of 1 J times a chi squared with J degrees of freedom It follows that with normally distributed disturbances JF converges to a chisquared variate with J degrees of freedom The proof is instructive See White 2001 9 76

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    107

    THEOREM 6 1 Limiting Distribution of the Wald Statistic d If n b N 0 2 Q 1 and if H0 R q 0 is true then W Rb q Rs 2 X X 1 R 1 Rb q JF 2 J Proof Since R is a matrix of constants and R q d nR b n Rb q N 0 R 2 Q 1 R For convenience write this equation as z N 0 P
    d d

    1

    2

    In Section A 6 11 we de ne the inverse square root of a positive de nite matrix P as another matrix say T such that T2 P 1 and denote T as P 1 2 Let T be the inverse square root of P Then by the same reasoning as in 1 and 2 if z N 0 P
    d

    then P 1 2 z N 0 P 1 2 PP 1 2 N 0 I

    d

    3

    We now invoke Theorem D 21 for the limiting distribution of a function of a random variable The sum of squares of uncorrelated i e independent standard normal variables is distributed as chi squared Thus the limiting distribution of P 1 2 z P 1 2 z z P 1 z 2 J
    d

    4

    Reassembling the parts from before we have shown that the limiting distribution of n Rb q R 2 Q 1 R 1 Rb q 5

    is chi squared with J degrees of freedom Note the similarity of this result to the results of Section B 11 6 Finally if plim s 2 1 XX n
    1

    2 Q 1

    6

    then the statistic obtained by replacing 2 Q 1 by s 2 X X n 1 in 5 has the same limiting distribution The ns cancel and we are left with the same Wald statistic we looked at before This step completes the proof

    The appropriate critical values for the F test of the restrictions R q 0 converge from above to 1 J times those for a chi squared test based on the Wald statistic see the Appendix tables For example for testing J 5 restrictions the critical value from the chi squared table Appendix Table G 4 for 95 percent signi cance is 11 07 The critical values from the F table Appendix Table G 5 are 3 33 16 65 5 for n K 10 2 60 13 00 5 for n K 25 2 40 12 00 5 for n K 50 2 31 11 55 5 for n K 100 and 2 214 11 07 5 for large n K Thus with normally distributed disturbances as n gets large the F test can be carried out by referring JF to the critical values from the chi squared table

    Greene 50240

    book

    June 3 2002

    10 1

    108

    CHAPTER 6 Inference and Prediction

    The crucial result for our purposes here is that the distribution of the Wald statistic is built up from the distribution of b which is asymptotically normal even without normally distributed disturbances The implication is that an appropriate large sample test statistic is chi squared JF Once again this implication relies on the central limit theorem not on normally distributed disturbances Now what is the appropriate approach for a small or moderately sized sample As we saw earlier the critical values for the F distribution converge from above to 1 J times those for the preceding chi squared distribution As before one cannot say that this will always be true in every case for every possible con guration of the data and parameters Without some special con guration of the data and parameters however one can expect it to occur generally The implication is that absent some additional rm characterization of the model the F statistic with the critical values from the F table remains a conservative approach that becomes more accurate as the sample size increases Exercise 7 at the end of this chapter suggests another approach to testing that has validity in large samples a Lagrange multiplier test The vector of Lagrange multipliers in 6 14 is R X X 1 R 1 Rb q that is a multiple of the least squares discrepancy vector In principle a test of the hypothesis that equals zero should be equivalent to a test of the null hypothesis Since the leading matrix has full rank this can only equal zero if the discrepancy equals zero A Wald test of the hypothesis that 0 is indeed a valid way to proceed The large sample distribution of the Wald statistic would be chi squared with J degrees of freedom The procedure is considered in Exercise 7 For a set of exclusion restrictions 2 0 there is a simple way to carry out this test The chi squared statistic in this case with K2 degrees of freedom can be computed as nR2 in the regression of e the residuals in the short regression on the full set of independent variables

    6 5

    TESTING NONLINEAR RESTRICTIONS The preceding discussion has relied heavily on the linearity of the regression model When we analyze nonlinear functions of the parameters and nonlinear regression models most of these exact distributional results no longer hold The general problem is that of testing a hypothesis that involves a nonlinear function of the regression coef cients H0 c q We shall look rst at the case of a single restriction The more general one in which c q is a set of restrictions is a simple extension The counterpart to the test statistic we used earlier would be z c q estimated standard error 6 24

    or its square which in the preceding were distributed as t n K and F 1 n K respectively The discrepancy in the numerator presents no dif culty Obtaining an estimate of the sampling variance of c q however involves the variance of a nonlinear function of

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    109

    The results we need for this computation are presented in Sections B 10 3 and D 3 1 A linear Taylor series approximation to c around the true parameter vector is c c c 6 25

    We must rely on consistency rather than unbiasedness here since in general the expected value of a nonlinear function is not equal to the function of the expected value If plim then we are justi ed in using c as an estimate of c The relevant result is the Slutsky theorem Assuming that our use of this approximation is appropriate the variance of the nonlinear function is approximately equal to the variance of the right hand side which is then Var c c c Var 6 26

    The derivatives in the expression for the variance are functions of the unknown parameters Since these are being estimated we use our sample estimates in computing the derivatives To estimate the variance of the estimator we can use s 2 X X 1 Finally we rely on Theorem D 2 2 in Section D 3 1 and use the standard normal distribution instead of the t distribution for the test statistic Using g to estimate g c we can now test a hypothesis in the same fashion we did earlier
    Example 6 3 A Long Run Marginal Propensity to Consume

    A consumption function that has different short and long run marginal propensities to consume can be written in the form ln Ct ln Yt ln Ct 1 t which is a distributed lag model In this model the short run marginal propensity to consume MPC elasticity since the variables are in logs is and the long run MPC is 1 Consider testing the hypothesis that 1 Quarterly data on aggregate U S consumption and disposable personal income for the years 1950 to 2000 are given in Appendix Table F5 1 The estimated equation based on these data is ln Ct 0 003142 0 07495 ln Yt 0 9246 ln Ct 1 et 0 01055 0 02873 0 02859 R 2 0 999712 s 0 00874

    Estimated standard errors are shown in parentheses We will also require Est Asy Cov b c 0 0003298 The estimate of the long run MPC is d b 1 c 0 07495 1 0 9246 0 99403 To compute the estimated variance of d we will require gb d d 1 b 13 1834 13 2626 gc b 1 c c 1 c 2

    The estimated asymptotic variance of d is
    2 2 Est Asy Var d gb Est Asy Var b gc Est Asy Var c 2gbgcEst Asy Cov b c

    13 26262 0 028732 13 18342 0 028592 2 13 2626 13 1834 0 0003298 0 17192

    Greene 50240

    book

    June 3 2002

    10 1

    110

    CHAPTER 6 Inference and Prediction

    The square root is 0 41464 To test the hypothesis that the long run MPC is greater than or equal to 1 we would use z 0 99403 1 0 0144 0 41464

    Because we are using a large sample approximation we refer to a standard normal table instead of the t distribution The hypothesis that 1 is not rejected You may have noticed that we could have tested this hypothesis with a linear restriction instead if 1 then 1 or 1 The estimate is q b c 1 0 00045 The estimated standard error of this linear function is 0 028732 0 028592 2 0 0003298 1 2 0 03136 The t ratio for this test is 0 01435 which is the same as before Since the sample used here is fairly large this is to be expected However there is nothing in the computations that assures this outcome In a smaller sample we might have obtained a different answer For example using the last 11 years of the data the t statistics for the two hypotheses are 7 652 and 5 681 The Wald test is not invariant to how the hypothesis is formulated In a borderline case we could have reached a different conclusion This lack of invariance does not occur with the likelihood ratio or Lagrange multiplier tests discussed in Chapter 17 On the other hand both of these tests require an assumption of normality whereas the Wald statistic does not This illustrates one of the trade offs between a more detailed speci cation and the power of the test procedures that are implied

    The generalization to more than one function of the parameters proceeds along similar lines Let c be a set of J functions of the estimated parameter vector and let the J K matrix of derivatives of c be c G The estimate of the asymptotic covariance matrix of these functions is Est Asy Var c G Est Asy Var G 6 28 6 27

    The jth row of G is K derivatives of c j with respect to the K elements of For example the covariance matrix for estimates of the short and long run marginal propensities to consume would be obtained using G 0 0 1 1 1 0 1 2

    The statistic for testing the J hypotheses c q is W c q Est Asy Var c
    1

    c q

    6 29

    In large samples W has a chi squared distribution with degrees of freedom equal to the number of restrictions Note that for a single restriction this value is the square of the statistic in 6 24

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    111

    6 6

    PREDICTION After the estimation of parameters a common use of regression is for prediction 8 Suppose that we wish to predict the value of y0 associated with a regressor vector x0 This value would be y0 x0 0 It follows from the Gauss Markov theorem that y0 x0 b e0 y0 y0 b x0 0 The prediction variance to be applied to this estimate is Var e0 X x0 2 Var b x0 X x0 2 x0 2 X X 1 x0 If the regression contains a constant term then an equivalent expression is K 1 K 1 1 0 Var e0 2 1 x 0 x j xk xk Z M0 Z jk j n
    j 1 k 1

    6 30

    is the minimum variance linear unbiased estimator of E y0 x0 The forecast error is

    6 31

    where Z is the K 1 columns of X not including the constant This result shows that the width of the interval depends on the distance of the elements of x0 from the center of the data Intuitively this idea makes sense the farther the forecasted point is from the center of our experience the greater is the degree of uncertainty The prediction variance can be estimated by using s 2 in place of 2 A con dence interval for y0 would be formed using a prediction interval y0 t 2 se e0 Figure 6 1 shows the effect for the bivariate case Note that the prediction variance is composed of three parts The second and third become progressively smaller as we accumulate more data i e as n increases But the rst term 2 is constant which implies that no matter how much data we have we can never predict perfectly
    Example 6 4 Prediction for Investment

    Suppose that we wish to predict the rst quarter 2001 value of real investment The average rate secondary market for the 90 day T bill was 4 48 down from 6 03 at the end of 2000 real GDP was 9316 8 the CPI U was 528 0 and the time trend would equal 204 We dropped one observation to compute the rate of in ation Data were obtained from www economagic com The rate of in ation on a yearly basis would be
    8 It is necessary at this point to make a largely semantic distinction between prediction and forecasting We

    will use the term prediction to mean using the regression model to compute tted values of the dependent variable either within the sample or for observations outside the sample The same set of results will apply to cross sections time series or panels These are the methods considered in this section It is helpful at this point to reserve the term forecasting for usage of the time series models discussed in Chapter 20 One of the distinguishing features of the models in that setting will be the explicit role of time and the presence of lagged variables and disturbances in the equations and correlation of variables with past values

    Greene 50240

    book

    June 3 2002

    10 1

    112

    CHAPTER 6 Inference and Prediction

    y y

    y

    a b a b

    x
    FIGURE 6 1 Prediction Intervals

    x

    100 4 ln 528 0 521 1 5 26 The data vector for predicting ln I 2001 1 would be x0 1 4 48 5 26 9 1396 204 Using the regression results in Example 6 1 x0 b 1 4 48 5 26 9 1396 204 9 1345 0 008601 0 003308 1 9302 0 005659 7 3312 The estimated variance of this prediction is s2 1 x0 X X 1 x0 0 0076912 6 32

    The square root 0 087699 gives the prediction standard deviation Using this value we obtain the prediction interval 7 3312 1 96 0 087699 7 1593 7 5031 The yearly rate of real investment in the rst quarter of 2001 was 1721 The log is 7 4507 so our forecast interval contains the actual value We have forecasted the log of real investment with our regression model If it is desired to forecast the level the natural estimator would be I exp ln I Assuming that the estimator itself is at least asymptotically normally distributed this should systematically underestimate the level by a factor of exp 2 2 based on the mean of the lognormal distribution See Wooldridge 2000 p 203 and Section B 4 4 It remains to determine what to use for 2 In 6 32 the second part of the expression will vanish in large samples leaving as Wooldridge suggests s 2 0 007427 9 Using this scaling we obtain a prediction of 1532 9 which is still 11 percent below the actual value Evidently this model based on an extremely long time series does not do a very good job of predicting at the end of the sample period One might surmise various reasons including some related to the model speci cation that we will address in Chapter 20 but as a rst guess it seems optimistic to apply an equation this simple to more than 50 years of data while expecting the underlying structure to be unchanging
    9 Wooldridge

    suggests an alternative not necessarily based on an assumption of normality Use as the scale factor the single coef cient in a within sample regression of yi on the exponents of the tted logs

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    113

    through the entire period To investigate this possibility we redid all the preceding calculations using only the data from 1990 to 2000 for the estimation The prediction for the level of investment in 2001 1 is now 1885 2 using the suggested scaling which is an overestimate of 9 54 percent But this is more easily explained The rst quarter of 2001 began the rst recession in the U S economy in nearly 10 years and one of the early symptoms of a recession is a rapid decline in business investment

    All the preceding assumes that x0 is either known with certainty ex post or forecasted perfectly If x0 must itself be forecasted an ex ante forecast then the formula for the forecast variance in 6 31 would have to be modi ed to include the variation in x0 which greatly complicates the computation Most authors view it as simply intractable Beginning with Feldstein 1971 derivation of rm analytical results for the correct forecast variance for this case remain to be derived except for simple special cases The one qualitative result that seems certain is that 6 31 will understate the true variance McCullough 1996 presents an alternative approach to computing appropriate forecast standard errors based on the method of bootstrapping See the end of Section 16 3 2 Various measures have been proposed for assessing the predictive accuracy of forecasting models 10 Most of these measures are designed to evaluate ex post forecasts that is forecasts for which the independent variables do not themselves have to be forecasted Two measures that are based on the residuals from the forecasts are the root mean squared error RMSE and the mean absolute error MAE 1 n0 yi yi
    i

    1 n0

    yi yi 2
    i

    where n0 is the number of periods being forecasted Note that both of these as well as the measures below are backward looking in that they are computed using the observed data on the independent variable These statistics have an obvious scaling problem multiplying values of the dependent variable by any scalar multiplies the measure by that scalar as well Several measures that are scale free are based on the Theil U statistic 11 U 1 n0 i yi yi 2 1 n0 i yi2

    This measure is related to R2 but is not bounded by zero and one Large values indicate a poor forecasting performance An alternative is to compute the measure in terms of the changes in y U 1 n0 i yi yi 2 0 2 1 n i yi

    10 See

    Theil 1961 and Fair 1984 1961

    11 Theil

    Greene 50240

    book

    June 3 2002

    10 1

    114

    CHAPTER 6 Inference and Prediction

    where yi yi yi 1 and yi yi yi 1 or in percentage changes yi yi yi 1 yi 1 and yi yi yi 1 yi 1 These measures will re ect the model s ability to track turning points in the data

    6 7

    SUMMARY AND CONCLUSIONS This chapter has focused on two uses of the linear regression model hypothesis testing and basic prediction The central result for testing hypotheses is the F statistic The F ratio can be produced in two equivalent ways rst by measuring the extent to which the unrestricted least squares estimate differs from what a hypothesis would predict and second by measuring the loss of t that results from assuming that a hypothesis is correct We then extended the F statistic to more general settings by examining its large sample properties which allow us to discard the assumption of normally distributed disturbances and by extending it to nonlinear restrictions

    Key Terms and Concepts
    Alternative hypothesis Distributed lag Discrepancy vector Exclusion restrictions Ex post forecast Lagrange multiplier test Limiting distribution Linear restrictions Nested models Nonlinear restriction Nonnested models Noninvariance of Wald test Nonnormality Null hypothesis Parameter space Prediction interval Prediction variance Restricted least squares Root mean squared error Testable implications Theil U statistic Wald criterion

    Exercises 1 A multiple regression of y on a constant x1 and x2 produces the following results y 4 0 4x1 0 9x2 R2 8 60 e e 520 n 29 29 0 0 X X 0 50 10 0 10 80 Test the hypothesis that the two slopes sum to 1 2 Using the results in Exercise 1 test the hypothesis that the slope on x1 is 0 by running the restricted regression and comparing the two sums of squared deviations 3 The regression model to be analyzed is y X1 1 X2 2 where X1 and X2 have K1 and K2 columns respectively The restriction is 2 0 a Using 6 14 prove that the restricted estimator is simply b1 0 where b1 is the least squares coef cient vector in the regression of y on X1 b Prove that if the restriction is 2 0 for a nonzero 0 then the restricted 2 2 estimator of 1 is b1 X1 X1 1 X1 y X2 0 2 4 The expression for the restricted coef cient vector in 6 14 may be written in the form b I CR b w where w does not involve b What is C Show that the

    Greene 50240

    book

    June 3 2002

    10 1

    CHAPTER 6 Inference and Prediction

    115

    covariance matrix of the restricted least squares estimator is 2 X X 1 2 X X 1 R R X X 1 R 1 R X X 1 and that this matrix may be written as Var b X Var b X 1 R Var Rb X 1 R Var b X 5 Prove the result that the restricted least squares estimator never has a larger covariance matrix than the unrestricted least squares estimator 6 Prove the result that the R2 associated with a restricted least squares estimator is never larger than that associated with the unrestricted least squares estimator Conclude that imposing restrictions never improves the t of the regression 7 The Lagrange multiplier test of the hypothesis R q 0 is equivalent to a Wald test of the hypothesis that 0 where is de ned in 6 14 Prove that 2 Est Var
    1

    n K

    e e 1 ee

    Note that the fraction in brackets is the ratio of two estimators of 2 By virtue of 6 19 and the preceding discussion we know that this ratio is greater than 1 Finally prove that the Lagrange multiplier statistic is equivalent to JF where J is the number of restrictions being tested and F is the conventional F statistic given in 6 6 8 Use the Lagrange multiplier test to test the hypothesis in Exercise 1 9 Using the data and model of Example 2 3 carry out a test of the hypothesis that the three aggregate price indices are not signi cant determinants of the demand for gasoline 10 The full model of Example 2 3 may be written in logarithmic terms as ln G pop p ln Pg y ln Y nc ln Pnc uc ln Puc pt ln Ppt year d ln Pd n ln Pn s ln Ps Consider the hypothesis that the microelasticities are a constant proportion of the elasticity with respect to their corresponding aggregate Thus for some positive presumably between 0 and 1 nc d uc d pt s The rst two imply the simple linear restriction nc uc By taking ratios the rst or second and third imply the nonlinear restriction nc d pt s or nc s pt d 0

    a Describe in detail how you would test the validity of the restriction b Using the gasoline market data in Table F2 2 test the restrictions separately and jointly 11 Prove that under the hypothesis that R q the estimator
    2 s

    y Xb y Xb n K J

    where J is the number of restrictions is unbiased for 2 12 Show that in the multiple regression of y on a constant x1 and x2 while imposing the restriction 1 2 1 leads to the regression of y x1 on a constant and x2 x1

    Greene 50240

    book

    June 11 2002

    18 46

    7

    FUNCTIONAL FORM AND STRUCTURAL CHANGE

    Q
    7 1 INTRODUCTION In this chapter we are concerned with the functional form of the regression model Many different types of functions are linear by the de nition considered in Section 2 3 1 By using different transformations of the dependent and independent variables dummy variables and different arrangements of functions of variables a wide variety of models can be constructed that are all estimable by linear least squares Section 7 2 considers using binary variables to accommodate nonlinearities in the model Section 7 3 broadens the class of models that are linear in the parameters Sections 7 4 and 7 5 then examine the issue of specifying and testing for change in the underlying model that generates the data under the heading of structural change

    7 2

    USING BINARY VARIABLES One of the most useful devices in regression analysis is the binary or dummy variable A dummy variable takes the value one for some observations to indicate the presence of an effect or membership in a group and zero for the remaining observations Binary variables are a convenient means of building discrete shifts of the function into a regression model
    7 2 1 BINARY VARIABLES IN REGRESSION

    Dummy variables are usually used in regression equations that also contain other quantitative variables In the earnings equation in Example 4 3 we included a variable Kids to indicate whether there were children in the household under the assumption that for many married women this fact is a signi cant consideration in labor supply behavior The results shown in Example 7 1 appear to be consistent with this hypothesis
    Example 7 1 Dummy Variable in an Earnings Equation

    Table 7 1 following reproduces the estimated earnings equation in Example 4 3 The variable Kids is a dummy variable which equals one if there are children under 18 in the household and zero otherwise Since this is a semilog equation the value of 35 for the coef cient is an extremely large effect that suggests that all other things equal the earnings of women with children are nearly a third less than those without This is a large difference but one that would certainly merit closer scrutiny Whether this effect results from different labor market effects which affect wages and not hours or the reverse remains to be seen Second having chosen a nonrandomly selected sample of those with only positive earnings to begin with it is unclear whether the sampling mechanism has itself induced a bias in this coef cient 116

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    117

    TABLE 7 1

    Estimated Earnings Equation

    ln earnings 1 2 age 3 age2 4 education 5 kids Sum of squared residuals 599 4582 Standard error of the regression 1 19044 R2 based on 428 observations
    Variable Coef cient

    0 040995
    Standard Error t Ratio

    Constant Age Age2 Education Kids

    3 24009 0 20056 0 0023147 0 067472 0 35119

    1 7674 0 08386 0 00098688 0 025248 0 14753

    1 833 2 392 2 345 2 672 2 380

    In recent applications researchers in many elds have studied the effects of treatment on some kind of response Examples include the effect of college on lifetime income sex differences in labor supply behavior as in Example 7 1 and in salary structures in industries and in pre versus postregime shifts in macroeconomic models to name but a few These examples can all be formulated in regression models involving a single dummy variable yi xi di i One of the important issues in policy analysis concerns measurement of such treatment effects when the dummy variable results from an individual participation decision For example in studies of the effect of job training programs on post training earnings the treatment dummy might be measuring the latent motivation and initiative of the participants rather than the effect of the program itself We will revisit this subject in Section 22 4 It is common for researchers to include a dummy variable in a regression to account for something that applies only to a single observation For example in time series analyses an occasional study includes a dummy variable that is one only in a single unusual year such as the year of a major strike or a major policy event See for example the application to the German money demand function in Section 20 6 5 It is easy to show we consider this in the exercises the very useful implication of this

    A dummy variable that takes the value one only for one observation has the effect of deleting that observation from computation of the least squares slopes and variance estimator but not R squared

    7 2 2

    SEVERAL CATEGORIES

    When there are several categories a set of binary variables is necessary Correcting for seasonal factors in macroeconomic data is a common application We could write a consumption function for quarterly data as Ct 1 2 xt 1 Dt 1 2 Dt 2 3 Dt 3 t

    Greene 50240

    book

    June 11 2002

    18 46

    118

    CHAPTER 7 Functional Form and Structural Change

    where xt is disposable income Note that only three of the four quarterly dummy variables are included in the model If the fourth were included then the four dummy variables would sum to one at every observation which would reproduce the constant term a case of perfect multicollinearity This is known as the dummy variable trap Thus to avoid the dummy variable trap we drop the dummy variable for the fourth quarter Depending on the application it might be preferable to have four separate dummy variables and drop the overall constant 1 Any of the four quarters or 12 months can be used as the base period The preceding is a means of deseasonalizing the data Consider the alternative formulation Ct xt 1 Dt 1 2 Dt 2 3 Dt 3 4 Dt 4 t 7 1 Using the results from Chapter 3 on partitioned regression we know that the preceding multiple regression is equivalent to rst regressing C and x on the four dummy variables and then using the residuals from these regressions in the subsequent regression of deseasonalized consumption on deseasonalized income Clearly deseasonalizing in this fashion prior to computing the simple regression of consumption on income produces the same coef cient on income and the same vector of residuals as including the set of dummy variables in the regression
    7 2 3 SEVERAL GROUPINGS

    The case in which several sets of dummy variables are needed is much the same as those we have already considered with one important exception Consider a model of statewide per capita expenditure on education y as a function of statewide per capita income x Suppose that we have observations on all n 50 states for T 10 years A regression model that allows the expected expenditure to change over time as well as across states would be yit xit i t it 7 2

    As before it is necessary to drop one of the variables in each set of dummy variables to avoid the dummy variable trap For our example if a total of 50 state dummies and 10 time dummies is retained a problem of perfect multicollinearity remains the sums of the 50 state dummies and the 10 time dummies are the same that is 1 One of the variables in each of the sets or the overall constant term and one of the variables in one of the sets must be omitted
    Example 7 2 Analysis of Covariance

    The data in Appendix Table F7 1 were used in a study of ef ciency in production of airline services in Greene 1997b The airline industry has been a favorite subject of study e g Schmidt and Sickles 1984 Sickles Good and Johnson 1986 partly because of interest in this rapidly changing market in a period of deregulation and partly because of an abundance of large high quality data sets collected by the no longer existent Civil Aeronautics Board The original data set consisted of 25 rms observed yearly for 15 years 1970 to 1984 a balanced panel Several of the rms merged during this period and several others experienced strikes which reduced the number of complete observations substantially Omitting these and others because of missing data on some of the variables left a group of 10 full
    1 See

    Suits 1984 and Greene and Seaks 1991

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    119

    Estimated Year Specific Effects 1 0 1 2 Year 3 4 5 6 7 8 1969
    FIGURE 7 1

    1974 Year

    1979

    1984

    Estimated Year Dummy Variable Coef cients

    observations from which we have selected six for the examples to follow We will t a cost equation of the form ln Ci t 1 2 ln Qi t 3 ln2 Qi t 4 ln Pfuel i t 5 Loadfactor i t
    14 5


    t 1

    t Di t
    i 1

    i Fi t i t

    The dummy variables are Di t which is the year variable and Fi t which is the rm variable We have dropped the last one in each group The estimated model for the full speci cation is ln Ci t 13 56 8866 ln Qi t 0 01261 ln2 Qi t 0 1281 ln P f i t 0 8855 LF i t time effects rm effects The year effects display a revealing pattern as shown in Figure 7 1 This was a period of rapidly rising fuel prices so the cost effects are to be expected Since one year dummy variable is dropped the effect shown is relative to this base year 1984 We are interested in whether the rm effects the time effects both or neither are statistically signi cant Table 7 2 presents the sums of squares from the four regressions The F statistic for the hypothesis that there are no rm speci c effects is 65 94 which is highly signi cant The statistic for the time effects is only 2 61 which is larger than the critical value
    TABLE 7 2 Model

    F tests for Firm and Year Effects
    Sum of Squares Parameters F Deg Fr

    Full Model Time Effects Firm Effects No Effects

    0 17257 1 03470 0 26815 1 27492

    24 19 10 5

    65 94 2 61 22 19

    5 66 14 66 19 66

    Greene 50240

    book

    June 11 2002

    18 46

    120

    CHAPTER 7 Functional Form and Structural Change

    of 1 84 but perhaps less so than Figure 7 1 might have suggested In the absence of the year speci c dummy variables the year speci c effects are probably largely absorbed by the price of fuel
    7 2 4 THRESHOLD EFFECTS AND CATEGORICAL VARIABLES

    In most applications we use dummy variables to account for purely qualitative factors such as membership in a group or to represent a particular time period There are cases however in which the dummy variable s represents levels of some underlying factor that might have been measured directly if this were possible For example education is a case in which we typically observe certain thresholds rather than say years of education Suppose for example that our interest is in a regression of the form income 1 2 age effect of education The data on education might consist of the highest level of education attained such as high school HS undergraduate B master s M or Ph D P An obviously unsatisfactory way to proceed is to use a variable E that is 0 for the rst group 1 for the second 2 for the third and 3 for the fourth That is income 1 2 age 3 E The dif culty with this approach is that it assumes that the increment in income at each threshold is the same 3 is the difference between income with a Ph D and a master s and between a master s and a bachelor s degree This is unlikely and unduly restricts the regression A more exible model would use three or four binary variables one for each level of education Thus we would write income 1 2 age B B M M P P The correspondence between the coef cients and income for a given age is High school E income age HS 1 2 age Bachelor s Masters Ph D E income age B 1 2 age B E income age M 1 2 age M E income age P 1 2 age P

    The differences between say P and M and between M and B are of interest Obviously these are simple to compute An alternative way to formulate the equation that reveals these differences directly is to rede ne the dummy variables to be 1 if the individual has the degree rather than whether the degree is the highest degree obtained Thus for someone with a Ph D all three binary variables are 1 and so on By de ning the variables in this fashion the regression is now High school E income age HS 1 2 age Bachelor s Masters Ph D E income age B 1 2 age B E income age M 1 2 age B M E income age P 1 2 age B M P

    Instead of the difference between a Ph D and the base case in this model P is the marginal value of the Ph D How equations with dummy variables are formulated is a matter of convenience All the results can be obtained from a basic equation

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    121

    Income

    18 Age
    FIGURE 7 2 Spline Function

    22

    7 2 5

    SPLINE REGRESSION

    If one is examining income data for a large cross section of individuals of varying ages in a population then certain patterns with regard to some age thresholds will be clearly evident In particular throughout the range of values of age income will be rising but the slope might change at some distinct milestones for example at age 18 when the typical individual graduates from high school and at age 22 when he or she graduates from college The time pro le of income for the typical individual in this population might appear as in Figure 7 2 Based on the discussion in the preceding paragraph we could t such a regression model just by dividing the sample into three subsamples However this would neglect the continuity of the proposed function The result would appear more like the dotted gure than the continuous function we had in mind Restricted regression and what is known as a spline function can be used to achieve the desired effect 2 The function we wish to estimate is E income age 0 0 age age
    1 2 1 2

    if age 18 if age 18 and age 22 if age 22

    age

    The threshold values 18 and 22 are called knots Let d1 1 d2 1
    2 An

    if age t1 if age t2

    important reference on this subject is Poirier 1974 An often cited application appears in Garber and Poirier 1974

    Greene 50240

    book

    June 11 2002

    18 46

    122

    CHAPTER 7 Functional Form and Structural Change
    where t1 18 and t2 22 To combine all three equations we use

    income 1 2 age 1 d1 1 d1 age 2 d2 2 d2 age

    7 3

    This relationship is the dashed function in Figure 7 2 The slopes in the three segments are 2 2 1 and 2 1 2 To make the function piecewise continuous we require that the segments join at the knots that is
    1 2 t1 1 1 2 1 t1

    and
    1 1 2 1 t2 1 1 2 2 1 2 t2

    These are linear restrictions on the coef cients Collecting terms the rst one is
    1 1 t1 0

    or

    1 1 t1

    Doing likewise for the second and inserting these in 7 3 we obtain
    income 1 2 age 1 d1 age t1 2 d2 age t2

    Constrained least squares estimates are obtainable by multiple regression using a constant and the variables x1 age x2 age 18 if age 18 and 0 otherwise and x3 age 22 if age 22 and 0 otherwise We can test the hypothesis that the slope of the function is constant with the joint test of the two restrictions 1 0 and 2 0 7 3 NONLINEARITY IN THE VARIABLES It is useful at this point to write the linear regression model in a very general form Let z z1 z2 zL be a set of L independent variables let f1 f2 fK be K linearly independent functions of z let g y be an observable function of y and retain the usual assumptions about the disturbance The linear regression model is g y 1 f1 z 2 f2 z K fK z 1 x1 2 x2 K xK x By using logarithms exponentials reciprocals transcendental functions polynomials products ratios and so on this linear model can be tailored to any number of situations
    7 3 1 FUNCTIONAL FORMS

    7 4

    A commonly used form of regression model is the loglinear model ln y ln
    k

    k ln Xk 1
    k

    k xk

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    123

    In this model the coef cients are elasticities y xk xk y ln y k ln xk 7 5

    In the loglinear equation measured changes are in proportional or percentage terms k measures the percentage change in y associated with a one percent change in xk This removes the units of measurement of the variables from consideration in using the regression model An alternative approach sometimes taken is to measure the variables and associated changes in standard deviation units If the data are standardized before estimation using xik xik xk sk and likewise for y then the least squares regression coef cients measure changes in standard deviation units rather than natural or percentage terms Note that the constant term disappears from this regression It is not necessary actually to transform the data to produce these results multiplying each least squares coef cient bk in the original regression by s y sk produces the same result A hybrid of the linear and loglinear models is the semilog equation ln y 1 2 x We used this form in the investment equation in Section 6 2 ln It 1 2 i t pt 3 pt 4 ln Yt 5 t t 7 6

    where the log of investment is modeled in the levels of the real interest rate the price level and a time trend In a semilog equation with a time trend such as this one d ln I dt 5 is the average rate of growth of I The estimated value of 005 in Table 6 1 suggests that over the full estimation period after accounting for all other factors the average rate of growth of investment was 5 percent per year The coef cients in the semilog model are partial or semi elasticities in 7 6 2 is ln y x This is a natural form for models with dummy variables such as the earnings equation in Example 7 1 The coef cient on Kids of 35 suggests that all else equal earnings are approximately 35 percent less when there are children in the household The quadratic earnings equation in Example 7 1 shows another use of nonlinearities in the variables Using the results in Example 7 1 we nd that for a woman with 12 years of schooling and children in the household the age earnings pro le appears as in Figure 7 3 This gure suggests an important question in this framework It is tempting to conclude that Figure 7 3 shows the earnings trajectory of a person at different ages but that is not what the data provide The model is based on a cross section and what it displays is the earnings of different people of different ages How this pro le relates to the expected earnings path of one individual is a different and complicated question Another useful formulation of the regression model is one with interaction terms For example a model relating braking distance D to speed S and road wetness W might be D 1 2 S 3 W 4 SW In this model E D S W 2 4 W S

    Greene 50240

    book

    June 11 2002

    18 46

    124

    CHAPTER 7 Functional Form and Structural Change

    Earnings Profile by Age 3500

    3000

    2500 Earnings

    2000

    1500

    1000

    500 20
    FIGURE 7 3

    29

    38 Age

    47

    56

    65

    Age Earnings Pro le

    which implies that the marginal effect of higher speed on braking distance is increased when the road is wetter assuming that 4 is positive If it is desired to form con dence intervals or test hypotheses about these marginal effects then the necessary standard error is computed from Var E D S W S Var 2 W2 Var 4 2W Cov 2 4

    and similarly for E D S W W A value must be inserted for W The sample mean is a natural choice but for some purposes a speci c value such as an extreme value of W in this example might be preferred
    7 3 2 IDENTIFYING NONLINEARITY

    If the functional form is not known a priori then there are a few approaches that may help at least to identify any nonlinearity and provide some information about it from the sample For example if the suspected nonlinearity is with respect to a single regressor in the equation then tting a quadratic or cubic polynomial rather than a linear function may capture some of the nonlinearity By choosing several ranges for the regressor in question and allowing the slope of the function to be different in each range a piecewise linear approximation to the nonlinear function can be t
    Example 7 3 Functional Form for a Nonlinear Cost Function

    In a celebrated study of economies of scale in the U S electric power industry Nerlove 1963 analyzed the production costs of 145 American electric generating companies This study

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    125

    produced several innovations in microeconometrics It was among the rst major applications of statistical cost analysis The theoretical development in Nerlove s study was the rst to show how the fundamental theory of duality between production and cost functions could be used to frame an econometric model Finally Nerlove employed several useful techniques to sharpen his basic model The focus of the paper was economies of scale typically modeled as a characteristic of the production function He chose a Cobb Douglas function to model output as a function of capital K labor L and fuel F Q 0 K K L L F F e i where Q is output and i embodies the unmeasured differences across rms The economies of scale parameter is r K L F The value one indicates constant returns to scale In this study Nerlove investigated the widely accepted assumption that producers in this industry enjoyed substantial economies of scale The production model is loglinear so assuming that other conditions of the classical regression model are met the four parameters could be estimated by least squares However he argued that the three factors could not be treated as exogenous variables For a rm that optimizes by choosing its factors of production the demand for fuel would be F F Q PK PL PF and likewise for labor and capital so certainly the assumptions of the classical model are violated In the regulatory framework in place at the time state commissions set rates and rms met the demand forthcoming at the regulated prices Thus it was argued that output as well as the factor prices could be viewed as exogenous to the rm and based on an argument by Zellner Kmenta and Dreze 1964 Nerlove argued that at equilibrium the deviation of costs from the long run optimum would be independent of output This has a testable implication which we will explore in Chapter 14 Thus the rm s objective was cost minimization subject to the constraint of the production function This can be formulated as a Lagrangean problem Min K L F PK K PL L PF F Q 0 K K L L F F The solution to this minimization problem is the three factor demands and the multiplier which measures marginal cost Inserted back into total costs this produces an intrinsically linear loglinear cost function PK K PL L PF F C Q PK PL PF r AQ1 r PK K PL L PF F e i r or ln C 1 q ln Q K ln PK L ln PL F ln PF ui 7 7
    r r r

    where q 1 K L F is now the parameter of interest and j j r j K L F 3 Thus the duality between production and cost functions has been used to derive the estimating equation from rst principles A complication remains The cost parameters must sum to one K L F 1 so estimation must be done subject to this constraint 4 This restriction can be imposed by regressing ln C PF on a constant ln Q ln PK PF and ln PL PF This rst set of results appears at the top of Table 7 3
    3 Readers 4 In

    who attempt to replicate the original study should note that Nerlove used common base 10 logs in his calculations not natural logs This change creates some numerical differences the context of the econometric model the restriction has a testable implication by the de nition in Chapter 6 But the underlying economics require this restriction it was used in deriving the cost function Thus it is unclear what is implied by a test of the restriction Presumably if the hypothesis of the restriction is rejected the analysis should stop at that point since without the restriction the cost function is not a valid representation of the production function We will encounter this conundrum again in another form in Chapter 14 Fortunately in this instance the hypothesis is not rejected It is in the application in Chapter 14

    Greene 50240

    book

    June 11 2002

    18 46

    126

    CHAPTER 7 Functional Form and Structural Change

    TABLE 7 3

    Cobb Douglas Cost Functions Standard Errors in Parentheses
    log Q log PL log PF log PK log PF R2

    All rms Group 1 Group 2 Group 3 Group 4 Group 5

    0 721 0 0174 0 398 0 668 0 931 0 915 1 045

    0 594 0 205 0 641 0 105 0 408 0 472 0 604

    0 0085 0 191 0 093 0 364 0 249 0 133 0 295

    0 952 0 512 0 635 0 571 0 871 0 920

    Initial estimates of the parameters of the cost function are shown in the top row of Table 7 3 The hypothesis of constant returns to scale can be rmly rejected The t ratio is 0 721 1 0 0174 16 03 so we conclude that this estimate is signi cantly less than one or by implication r is signi cantly greater than one Note that the coef cient on the capital price is negative In theory this should equal K r which unless the marginal product of capital is negative should be positive Nerlove attributed this to measurement error in the capital price variable This seems plausible but it carries with it the implication that the other coef cients are mismeasured as well See 5 31a b Christensen and Greene s 1976 estimator of this model with these data produced a positive estimate See Section 14 3 1 The striking pattern of the residuals shown in Figure 7 45 and some thought about the implied form of the production function suggested that something was missing from the model 6 In theory the estimated model implies a continually declining average cost curve which in turn implies persistent economies of scale at all levels of output This con icts with the textbook notion of a U shaped average cost curve and appears implausible for the data Note the three clusters of residuals in the gure Two approaches were used to analyze the model By sorting the sample into ve groups on the basis of output and tting separate regressions to each group Nerlove t a piecewise loglinear model The results are given in the lower rows of Table 7 3 where the rms in the successive groups are progressively larger The results are persuasive that the log linear cost function is inadequate The output coef cient that rises toward and then crosses 1 0 is consistent with a U shaped cost curve as surmised earlier A second approach was to expand the cost function to include a quadratic term in log output This approach corresponds to a much more general model and produced the result given in Table 7 4 Again a simple t test strongly suggests that increased generality is called for t 0 117 0 012 9 75 The output elasticity in this quadratic model is q 2 qq log Q 7 There are economies of scale when this value is less than one and constant returns to scale when it equals one Using the two values given in the table 0 151 and 0 117 respectively we nd that this function does indeed produce a U shaped average cost curve with minimum at log10 Q 1 0 151 2 0 117 3 628 or Q 4248 which was roughly in the middle of the range of outputs for Nerlove s sample of rms
    5 The 6A

    residuals are created as deviations of predicted total cost from actual so they do not sum to zero

    Durbin Watson test of correlation among the residuals see Section 12 5 1 revealed to the author a substantial autocorrelation Although normally used with time series data the Durbin Watson statistic and a test for autocorrelation can be a useful tool for determining the appropriate functional form in a cross sectional model To use this approach it is necessary to sort the observations based on a variable of interest output Several clusters of residuals of the same sign suggested a need to reexamine the assumed functional form

    7 Nerlove

    inadvertently measured economies of scale from this function as 1 q log Q where q and are the coef cients on log Q and log2 Q The correct expression would have been 1 log C log Q 1 q 2 log Q This slip was periodically rediscovered in several later papers

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    127

    Residuals from Total Cost 2 0

    1 5

    Residual

    1 0

    5

    0

    5

    0

    2

    4

    6 LOGQ

    8

    10

    12

    FIGURE 7 4

    Residuals from Predicted Cost

    This study was updated by Christensen and Greene 1976 Using the same data but a more elaborate translog functional form and by simultaneously estimating the factor demands and the cost function they found results broadly similar to Nerlove s Their preferred functional form did suggest that Nerlove s generalized model in Table 7 4 did somewhat underestimate the range of outputs in which unit costs of production would continue to decline They also redid the study using a sample of 123 rms from 1970 and found similar results In the latter sample however it appeared that many rms had expanded rapidly enough to exhaust the available economies of scale We will revisit the 1970 data set in a study of ef ciency in Section 17 6 4

    The preceding example illustrates three useful tools in identifying and dealing with unspeci ed nonlinearity analysis of residuals the use of piecewise linear regression and the use of polynomials to approximate the unknown regression function
    7 3 3 INTRINSIC LINEARITY AND IDENTIFICATION

    The loglinear model illustrates an intermediate case of a nonlinear regression model The equation is intrinsically linear by our de nition by taking logs of Yi Xi 2 e i we obtain ln Yi ln 2 ln Xi i
    TABLE 7 4

    7 8

    Log Quadratic Cost Function Standard Errors in Parentheses
    log Q log 2 Q log PL PF log PK PF R2

    All rms

    0 151 0 062

    0 117 0 012

    0 498 0 161

    0 062 0 151

    0 95

    Greene 50240

    book

    June 11 2002

    18 46

    128

    CHAPTER 7 Functional Form and Structural Change

    or yi 1 2 xi i Although this equation is linear in most respects something has changed in that it is no longer linear in Written in terms of 1 we obtain a fully linear model But that may not be the form of interest Nothing is lost of course since 1 is just ln If 1 can be estimated then an obvious estimate of is suggested This fact leads us to a second aspect of intrinsically linear models Maximum likelihood estimators have an invariance property In the classical normal regression model the maximum likelihood estimator of is the square root of the maximum likelihood estimator of 2 Under some conditions least squares estimators have the same property By exploiting this we can broaden the de nition of linearity and include some additional cases that might otherwise be quite complex

    DEFINITION 7 1 Intrinsic Linearity In the classical linear regression model if the K parameters 1 2 K can be written as K one to one possibly nonlinear functions of a set of K underlying parameters 1 2 K then the model is intrinsically linear in

    Example 7 4

    Intrinsically Linear Regression

    In Section 17 5 4 we will estimate the parameters of the model f y x x 1 y x ye

    by maximum likelihood In this model E y x x which suggests another way that we might estimate the two parameters This function is an intrinsically linear regression model E y x 1 2 x in which 1 and 2 We can estimate the parameters by least squares and then retrieve the estimate of using b1 b2 Since this value is a nonlinear function of the estimated parameters we use the delta method to estimate the standard error Using the data from that example the least squares estimates of 1 and 2 with standard errors in parentheses are 4 1431 23 734 and 2 4261 1 5915 The estimated covariance is 36 979 The estimate of is 4 1431 2 4261 1 7077 We estimate the sampling variance of with Est Var b1
    2

    Var b1

    b2

    2

    Var b2 2

    b1

    b2

    Cov b1 b2

    8 68892 Table 7 5 compares the least squares and maximum likelihood estimates of the parameters The lower standard errors for the maximum likelihood estimates result from the inef cient equal weighting given to the observations by the least squares procedure The gamma distribution is highly skewed In addition we know from our results in Appendix C that this distribution is an exponential family We found for the gamma distribution that the suf cient statistics for this density were i yi and i ln yi The least squares estimator does not use the second of these whereas an ef cient estimator will

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    129

    TABLE 7 5

    Estimates of the Regression in a Gamma Model Least Squares versus Maximum Likelihood
    Estimate Standard Error Estimate Standard Error

    Least squares Maximum likelihood

    1 708 4 719

    8 689 2 403

    2 426 3 151

    1 592 0 663

    The emphasis in intrinsic linearity is on one to one If the conditions are met then the model can be estimated in terms of the functions 1 K and the underlying parameters derived after these are estimated The one to one correspondence is an identi cation condition If the condition is met then the underlying parameters of the regression are said to be exactly identi ed in terms of the parameters of the linear model An excellent example is provided by Kmenta 1986 p 515
    Example 7 5 CES Production Function

    The constant elasticity of substitution production function may be written ln y ln ln K 1 L A Taylor series approximation to this function around the point 0 is ln y ln ln K 1 ln L 1 1 ln K ln L 2 2 1 x1 2 x2 3 x3 4 x4 where x1 1 x2 ln K x3 ln L x4 1 ln K L and the transformations are 2
    2

    7 9

    7 10 4 1 4 2 3 2 3

    1 ln e
    1

    2

    3 1 2 3

    2 2 3

    7 11

    Estimates of 1 2 3 and 4 can be computed by least squares The estimates of and obtained by the second row of 7 11 are the same as those we would obtain had we found the nonlinear least squares estimates of 7 10 directly As Kmenta shows however they are not the same as the nonlinear least squares estimates of 7 9 due to the use of the Taylor series approximation to get to 7 10 We would use the delta method to construct the estimated asymptotic covariance matrix for the estimates of The derivatives matrix is e 1 0 0 0 C

    0 0
    0

    2 2 3 2 1 3 4
    2 2 3

    2 2 3 2 1 2 4
    2 2 3

    0 0 2 3 2 3



    The estimated covariance matrix for is C s2 X X 1 C

    Not all models of the form yi 1 xi 1 2 xi 2 K xik i 7 12

    are intrinsically linear Recall that the condition that the functions be one to one i e that the parameters be exactly identi ed was required For example yi xi 1 xi 2 xi 3 i

    Greene 50240

    book

    June 11 2002

    18 46

    130

    CHAPTER 7 Functional Form and Structural Change

    is nonlinear The reason is that if we write it in the form of 7 12 we fail to account for the condition that 4 equals 2 3 which is a nonlinear restriction In this model the three parameters and are overidenti ed in terms of the four parameters 1 2 3 and 4 Unrestricted least squares estimates of 2 3 and 4 can be used to obtain two estimates of each of the underlying parameters and there is no assurance that these will be the same

    7 4

    MODELING AND TESTING FOR A STRUCTURAL BREAK One of the more common applications of the F test is in tests of structural change 8 In specifying a regression model we assume that its assumptions apply to all the observations in our sample It is straightforward however to test the hypothesis that some of or all the regression coef cients are different in different subsets of the data To analyze a number of examples we will revisit the data on the U S gasoline market9 that we examined in Example 2 3 As Figure 7 5 following suggests this market behaved in predictable unremarkable fashion prior to the oil shock of 1973 and was quite volatile thereafter The large jumps in price in 1973 and 1980 are clearly visible as is the much greater variability in consumption It seems unlikely that the same regression model would apply to both periods
    7 4 1 DIFFERENT PARAMETER VECTORS

    The gasoline consumption data span two very different periods Up to 1973 fuel was plentiful and world prices for gasoline had been stable or falling for at least two decades The embargo of 1973 marked a transition in this market at least for a decade or so marked by shortages rising prices and intermittent turmoil It is possible that the entire relationship described by our regression model changed in 1974 To test this as a hypothesis we could proceed as follows Denote the rst 14 years of the data in y and X as y1 and X1 and the remaining years as y2 and X2 An unrestricted regression that allows the coef cients to be different in the two periods is y1 X1 0 y2 0 X2 1 1 2 2 7 13

    Denoting the data matrices as y and X we nd that the unrestricted least squares estimator is X 1 X1 b X X X y 0
    1

    0 X2 X2

    1

    X1 y 1 b1 X2 y 2 b2

    7 14

    which is least squares applied to the two equations separately Therefore the total sum of squared residuals from this regression will be the sum of the two residual sums of
    8 This 9 The

    test is often labeled a Chow test in reference to Chow 1960 data are listed in Appendix Table A6 1

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    131

    4 5 4 0 3 5 3 0 PG 2 5 2 0 1 5 1 0 5 70
    FIGURE 7 5

    80

    90 G

    100

    110

    120

    Gasoline Price and Per Capita Consumption 1960 1995

    squares from the two separate regressions e e e 1 e1 e2 e2 The restricted coef cient vector can be obtained in two ways Formally the restriction 1 2 is R q where R I I and q 0 The general result given earlier can be applied directly An easier way to proceed is to build the restriction directly into the model If the two coef cient vectors are the same then 7 13 may be written y1 X1 1 y2 X2 2 and the restricted estimator can be obtained simply by stacking the data and estimating a single regression The residual sum of squares from this restricted regression e e then forms the basis for the test The test statistic is then given in 6 6 where J the number of restrictions is the number of columns in X2 and the denominator degrees of freedom is n1 n2 2k
    7 4 2 INSUFFICIENT OBSERVATIONS

    In some circumstances the data series are not long enough to estimate one or the other of the separate regressions for a test of structural change For example one might surmise that consumers took a year or two to adjust to the turmoil of the two oil price shocks in 1973 and 1979 but that the market never actually fundamentally changed or that it only changed temporarily We might consider the same test as before but now only single out the four years 1974 1975 1980 and 1981 for special treatment Since there are six coef cients to estimate but only four observations it is not possible to t

    Greene 50240

    book

    June 11 2002

    18 46

    132

    CHAPTER 7 Functional Form and Structural Change

    the two separate models Fisher 1970 has shown that in such a circumstance a valid way to proceed is as follows 1 2 Estimate the regression using the full data set and compute the restricted sum of squared residuals e e Use the longer adequate subperiod n1 observations to estimate the regression and compute the unrestricted sum of squares e1 e1 This latter computation is done assuming that with only n2 K observations we could obtain a perfect t and thus contribute zero to the sum of squares The F statistic is then computed using F n2 n1 K e e e1 e1 n2 e1 e1 n1 K 7 15

    3

    Note that the numerator degrees of freedom is n2 not K 10 This test has been labeled the Chow predictive test because it is equivalent to extending the restricted model to the shorter subperiod and basing the test on the prediction errors of the model in this latter period We will have a closer look at that result in Section 7 5 3
    7 4 3 CHANGE IN A SUBSET OF COEFFICIENTS

    The general formulation previously suggested lends itself to many variations that allow a wide range of possible tests Some important particular cases are suggested by our gasoline market data One possible description of the market is that after the oil shock of 1973 Americans simply reduced their consumption of gasoline by a xed proportion but other relationships in the market such as the income elasticity remained unchanged This case would translate to a simple shift downward of the log linear regression model or a reduction only in the constant term Thus the unrestricted equation has separate coef cients in the two periods while the restricted equation is a pooled regression with separate constant terms The regressor matrices for these two cases would be of the form unrestricted XU and restricted X R i0 0i Wpre73 Wpost73 i0 0i Wpre73 0 0 Wpost73

    The rst two columns of X are dummy variables that indicate the subperiod in which the observation falls Another possibility is that the constant and one or more of the slope coef cients changed but the remaining parameters remained the same The results in Table 7 6 suggest that the constant term and the price and income elasticities changed much more than the cross price elasticities and the time trend The Chow test for this type of restriction looks very much like the one for the change in the constant term alone Let Z denote the variables whose coef cients are believed to have changed and let W
    10 One

    way to view this is that only n2 K coef cients are needed to obtain this perfect t

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    133

    denote the variables whose coef cients are thought to have remained constant Then the regressor matrix in the constrained regression would appear as X ipre 0 Zpre 0 0 ipost 0 Zpost Wpre Wpost 7 16

    As before the unrestricted coef cient vector is the combination of the two separate regressions
    7 4 4 TESTS OF STRUCTURAL BREAK WITH UNEQUAL VARIANCES

    An important assumption made in using the Chow test is that the disturbance variance is the same in both or all regressions In the restricted model if this is not true the 2 2 rst n1 elements of have variance 1 whereas the next n2 have variance 2 and so on The restricted model is therefore heteroscedastic and our results for the classical regression model no longer apply As analyzed by Schmidt and Sickles 1977 Ohtani and Toyoda 1985 and Toyoda and Ohtani 1986 it is quite likely that the actual probability of a type I error will be smaller than the signi cance level we have chosen That is we shall regard as large an F statistic that is actually less than the appropriate but unknown critical value Precisely how severe this effect is going to be will depend on the data and the extent to which the variances differ in ways that are not likely to be obvious If the sample size is reasonably large then we have a test that is valid whether or not the disturbance variances are the same Suppose that 1 and 2 are two consistent and asymptotically normally distributed estimators of a parameter based on independent samples 11 with asymptotic covariance matrices V1 and V2 Then under the null hypothesis that the true parameters are the same 1 2 has mean 0 and asymptotic covariance matrix V1 V2 Under the null hypothesis the Wald statistic W 1 2 V1 V2 1 1 2 7 17

    has a limiting chi squared distribution with K degrees of freedom A test that the difference between the parameters is zero can be based on this statistic 12 It is straightforward to apply this to our test of common parameter vectors in our regressions Large values of the statistic lead us to reject the hypothesis In a small or moderately sized sample the Wald test has the unfortunate property that the probability of a type I error is persistently larger than the critical level we use to carry it out That is we shall too frequently reject the null hypothesis that the parameters are the same in the subsamples We should be using a larger critical value
    11 Without

    the required independence this test and several similar ones will fail completely The problem becomes a variant of the famous Behrens Fisher problem alternative If the variances are radically different the assumed critical values might be somewhat unreliable

    12 See Andrews and Fair 1988 The true size of this suggested test is uncertain It depends on the nature of the

    Greene 50240

    book

    June 11 2002

    18 46

    134

    CHAPTER 7 Functional Form and Structural Change

    Ohtani and Kobayashi 1986 have devised a bounds test that gives a partial remedy for the problem 13 It has been observed that the size of the Wald test may differ from what we have assumed and that the deviation would be a function of the alternative hypothesis There are two general settings in which a test of this sort might be of interest For comparing two possibly different populations such as the labor supply equations for men versus women not much more can be said about the suggested statistic in the absence of speci c information about the alternative hypothesis But a great deal of work on this type of statistic has been done in the time series context In this instance the nature of the alternative is rather more clearly de ned We will return to this analysis of structural breaks in time series models in Section 7 5 4

    7 5

    TESTS OF MODEL STABILITY The tests of structural change described in Section 7 4 assume that the process underlying the data is stable up to a known transition point where it makes a discrete change to a new but thereafter stable structure In our gasoline market that might be a reasonable assumption In many other settings however the change to a new regime might be more gradual and less obvious In this section we will examine two tests that are based on the idea that a regime change might take place slowly and at an unknown point in time or that the regime underlying the observed data might simply not be stable at all
    7 5 1 HANSEN S TEST

    Hansen s 1992 test of model stability is based on a cumulative sum of the least squares residuals From the least squares normal equations we have
    T T

    xt et 0
    t 1

    and
    t 1

    et2

    ee n

    0

    Let the vector ft be the K 1 1 t th observation in this pair of sums Then tT 1 ft 0 t Let the sequence of partial sums be st r 1 fr so sT 0 Finally let F T tT 1 ft ft and S tT 1 st st Hansen s test statistic can be computed simply as H tr F 1 S Large values of H give evidence against the hypothesis of model stability The logic of Hansen s test is that if the model is stable through the T periods then the cumulative sums in S will not differ greatly from those in F Note that the statistic involves both the regression and the variance The distribution theory underlying this nonstandard test statistic is much more complicated than the computation Hansen provides asymptotic critical values for the test of model constancy which vary with the number of coef cients in the model A few values for the 95 percent signi cance level are 1 01 for K 2 1 90 for K 6 3 75 for K 15 and 4 52 for K 19
    13 See also Kobayashi 1986 An alternative somewhat more cumbersome test is proposed by Jayatissa 1977

    Further discussion is given in Thursby 1982

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change 7 5 2 RECURSIVE RESIDUALS AND THE CUSUMS TEST

    135

    Example 7 6 shows a test of structural change based essentially on the model s ability to predict correctly outside the range of the observations used to estimate it A similar logic underlies an alternative test of model stability proposed by Brown Durbin and Evans 1975 based on recursive residuals The technique is appropriate for time series data and might be used if one is uncertain about when a structural change might have taken place The null hypothesis is that the coef cient vector is the same in every period the alternative is simply that it or the disturbance variance is not The test is quite general in that it does not require a prior speci cation of when the structural change takes place The cost however is that the power of the test is rather limited compared with that of the Chow test 14 Suppose that the sample contains a total of T observations 15 The t th recursive residual is the ex post prediction error for yt when the regression is estimated using only the rst t 1 observations Since it is computed for the next observation beyond the sample period it is also labeled a one step ahead prediction error et yt xt bt 1 where xt is the vector of regressors associated with observation yt and bt 1 is the least squares coef cients computed using the rst t 1 observations The forecast variance of this residual is 2t 2 1 xt Xt 1 Xt 1 1 xt f Let the r th scaled residual be wr er 1 xr Xr 1 Xr 1 1 xr 7 19 7 18

    Under the hypothesis that the coef cients remain constant during the full sample period wr N 0 2 and is independent of ws for all s r Evidence that the distribution of wr is changing over time weighs against the hypothesis of model stability One way to examine the residuals for evidence of instability is to plot wr see below simply against the date Under the hypothesis of the model these residuals are uncorrelated and are approximately normally distributed with mean zero and standard deviation 1 Evidence that these residuals persistently stray outside the error bounds 2 and 2 would suggest model instability Some authors and some computer packages plot er instead in which case the error bounds are 2 1 xr Xr 1 Xr 1 1 xr The CUSUM test is based on the cumulated sum of the residuals
    r t

    Wt
    r K 1

    wr
    T r K 1

    7 20 wr Under

    where 2 T K 1 1
    14 The

    T r K 1 wr

    w 2 and w T K 1

    test is frequently criticized on this basis The Chow test however is based on a rather de nite piece of information namely when the structural change takes place If this is not known or must be estimated then the advantage of the Chow test diminishes considerably we are dealing explicitly with time series data at this point it is convenient to use T instead of n for the sample size and t instead of i to index observations

    15 Since

    Greene 50240

    book

    June 11 2002

    18 46

    136

    CHAPTER 7 Functional Form and Structural Change

    the null hypothesis Wt has a mean of zero and a variance approximately equal to the number of residuals being summed because each term has variance 1 and they are independent The test is performed by plotting Wt against t Con dence bounds for the sum are obtained by plotting the two lines that connect the points K a T K 1 2 and T 3a T K 1 2 Values of a that correspond to various signi cance levels can be found in their paper Those corresponding to 95 percent and 99 percent are 0 948 and 1 143 respectively The hypothesis is rejected if Wt strays outside the boundaries
    Example 7 6 Structural Break in the Gasoline Market

    The previous Figure 7 5 shows a plot of prices and quantities in the U S gasoline market from 1960 to 1995 The rst 13 points are the layer at the bottom of the gure and suggest an orderly market The remainder clearly re ect the subsequent turmoil in this market We will use the Chow tests described to examine this market The model we will examine is the one suggested in Example 2 3 with the addition of a time trend ln G pop t 1 2 ln I pop 3 ln PGt 4 ln PNCt 5 ln PU Ct 6 t t The three prices in the equation are for G new cars and used cars I pop is per capita income and G pop is per capita gasoline consumption Regression results for four functional forms are shown in Table 7 6 Using the data for the entire sample 1960 to 1995 and for the two subperiods 1960 to 1973 and 1974 to 1995 we obtain the three estimated regressions in the rst and last two columns The F statistic for testing the restriction that the coef cients in the two equations are the same is F 6 24 0 02521877 0 000652271 0 004662163 6 14 958 0 000652271 0 004662163 14 22 12

    The tabled critical value is 2 51 so consistent with our expectations we would reject the hypothesis that the coef cient vectors are the same in the two periods Using the full set of 36 observations to t the model the sum of squares is e e 0 02521877 When the n1 4 observations for 1974 1975 1980 and 1981 are removed from the sample the sum of squares falls to e e 0 01968599 The F statistic is 1 817 Since the tabled critical value for F 4 32 6 is 2 72 we would not reject the hypothesis of stability The conclusion to this point would be that although something has surely changed in the market the hypothesis of a temporary disequilibrium seems not to be an adequate explanation An alternative way to compute this statistic might be more convenient Consider the original arrangement with all 36 observations We now add to this regression four binary variables Y1974 Y1975 Y1980 and Y1981 Each of these takes the value one in the single
    TABLE 7 6 Coef cients

    Gasoline Consumption Equations
    1960 1995 Pooled Preshock Postshock

    Constant Constant ln I pop ln PG ln PNC ln PUC Year R2 Standard error Sum of squares

    24 6718 1 95463 0 115530 0 205282 0 129274 0 019118 0 968275 0 02897572 0 02521877

    21 2630 21 3403 1 83817 0 178004 0 209842 0 128132 0 168618 0 978142 0 02463767 0 0176034

    51 1812 0 423995 0 0945467 0 583896 0 334619 0 0263665 0 998033 0 00902961 0 000652271 20 4464 1 01408 0 242374 0 330168 0 0553742 0 0126170 0 920642 0 017000 0 004662163

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    137

    year indicated and zero in all 35 remaining years We then compute the regression with the original six variables and these four additional dummy variables The sum of squared residuals in this regression is 0 01968599 so the F statistic for testing the joint hypothesis that the four coef cients are zero is F 4 36 10 0 02518777 0 01968599 4 0 01968599 36 10 1 817 once again See Section 7 4 2 for discussion of this test The F statistic for testing the restriction that the coef cients in the two equations are the same apart from the constant term is based on the last three sets of results in the table F 5 24 0 0176034 0 000652271 0 004662163 5 11 099 0 000652271 0 004662163 14 22 12

    The tabled critical value is 2 62 so this hypothesis is rejected as well The data suggest that the models for the two periods are systematically different beyond a simple shift in the constant term The F ratio that results from estimating the model subject to the restriction that the two automobile price elasticities and the coef cient on the time trend are unchanged is F 3 24 0 00802099 0 000652271 0 004662163 3 4 086 0 000652271 0 004662163 14 22 12

    The restricted regression is not shown The critical value from the F table is 3 01 so this hypothesis is rejected as well Note however that this value is far smaller than those we obtained previously The P value for this value is 0 981 so in fact at the 99 percent signi cance level we would not have rejected the hypothesis This fact suggests that the bulk of the difference in the models across the two periods is indeed explained by the changes in the constant and the price and income elasticities The test statistic in 7 17 for the regression results in Table 7 6 gives a value of 128 6673 The 5 percent critical value from the chi squared table for 6 degrees of freedom is 12 59 So on the basis of the Wald test we would reject the hypothesis that the same coef cient vector applies in the two subperiods 1960 to 1973 and 1974 to 1995 We should note that the Wald statistic is valid only in large samples and our samples of 14 and 22 observations hardly meet that standard We have tested the hypothesis that the regression model for the gasoline market changed in 1973 and on the basis of the F test Chow test we strongly rejected the hypothesis of model stability Hansen s test is not consistent with this result using the computations outlined earlier we obtain a value of H 1 7249 Since the critical value is 1 90 the hypothesis of model stability is now not rejected Figure 7 6 shows the CUSUM test for the gasoline market The results here are more or less consistent with the preceding results The gure does suggest a structural break though at 1984 not at 1974 or 1980 when we might have expected it
    7 5 3 PREDICTIVE TEST

    The hypothesis test de ned in 7 15 in Section 7 4 2 is equivalent to H0 2 1 in the model yt xt 1 t t 1 T1 yt xt 2 t t T1 1 T1 T2

    Note that the disturbance variance is assumed to be the same in both subperiods An alternative formulation of the model the one used in the example is y1 X1 X2 y2 0 I 1 2

    Greene 50240

    book

    June 11 2002

    18 46

    138

    CHAPTER 7 Functional Form and Structural Change

    Plot of Cumulative Sum of Residuals 1 6

    7 Cusum 10 01

    1

    9

    1 7

    2 5 1959

    1964

    1969

    1974

    1979 Year

    1984

    1989

    1994

    1999

    FIGURE 7 6

    CUSUM Test

    This formulation states that yt xt 1 t yt xt 2 t t

    t 1 T1 t T1 1 T1 T2

    Since each t is unrestricted this alternative formulation states that the regression model of the rst T1 periods ceases to operate in the second subperiod and in fact no systematic model operates in the second subperiod A test of the hypothesis 0 in this framework would thus be a test of model stability The least squares coef cients for this regression can be found by using the formula for the partitioned inverse matrix b c X 1 X 1 X2 X 2 X2 X1 X1 1 X2 X1 X1 b1 c2
    1

    X2 I

    1

    X1 y1 X2 y2 y2



    X1 X1 1 X2 I X2 X1 X1 X2
    1

    X1 y1 X2 y2 y2



    where b1 is the least squares slopes based on the rst T1 observations and c2 is y2 X2 b1 The covariance matrix for the full set of estimates is s 2 times the bracketed matrix The two subvectors of residuals in this regression are e1 y1 X1 b1 and e2 y2 X2 b1 Ic2 0 so the sum of squared residuals in this least squares regression is just e1 e1 This is the same sum of squares as appears in 7 15 The degrees of freedom for the denominator is T1 T2 K T2 T1 K as well and the degrees of freedom for

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    139

    the numerator is the number of elements in which is T2 The restricted regression with 0 is the pooled model which is likewise the same as appears in 7 15 This implies that the F statistic for testing the null hypothesis in this model is precisely that which appeared earlier in 7 15 which suggests why the test is labeled the predictive test
    7 5 4 UNKNOWN TIMING OF THE STRUCTURAL BREAK16

    The testing procedures described in this section all assume that the point of the structural break is known When this corresponds to a discrete historical event this is a reasonable assumption But in some applications the timing of the break may be unknown The Chow and Wald tests become useless at this point The CUSUMS test is a step in the right direction for this situation but as noted by a number of authors e g Andrews 1993 it has serious power problems Recent research has provided several strategies for testing for structural change when the change point is unknown In Section 7 4 we considered a test of parameter equality in two populations The natural approach suggested there was a comparison of two separately estimated parameter vectors based on the Wald criterion W 1 2 V1 V2 1 1 2 where 1 and 2 denote the two populations An alternative approach to the testing procedure is based on a likelihood ratio like statistic h L1 L2 L where L1 L2 is the log likelihood function or other estimation criterion under the alternative hypothesis of model instability structural break and L is the log likelihood for the pooled estimator based on the null hypothesis of stability and h is the appropriate function of the values such as h a b 2 b a for maximum likelihood estimation A third approach based on the Lagrange multiplier principle will be developed below There is a major problem with this approach the split between the two subsamples must be known in advance In the time series application we will examine in this section the problem to be analyzed is that of determining whether a model can be claimed to be stable through a sample period t 1 T against the alternative hypothesis that the structure changed at some unknown time t Knowledge of the sample split is crucial for the tests suggested above so some new results are called for We suppose that the model E m yt xt 0 is to be estimated by GMM using T observations The model is stated in terms of a moment condition but we intend for this to include estimation by maximum likelihood or linear or nonlinear least squares As noted earlier all these cases are included Assuming GMM just provides us a convenient way to analyze all the cases at the same time The hypothesis to be investigated is as follows Let T T1 denote the integer part of T where 0 1 Thus this is a proportion of the sample observations and de nes subperiod 1 t 1 T1 Under the null hypothesis the model E m yt xt 0 is stable for the entire sample period Under the alternative hypothesis the model E m yt xt 1 0 applies to
    16 The

    material in this section is more advanced than that in the discussion thus far It may be skipped at this point with no loss in continuity Since this section relies heavily on GMM estimation methods you may wish to read Chapter 18 before continuing

    Greene 50240

    book

    June 11 2002

    18 46

    140

    CHAPTER 7 Functional Form and Structural Change

    observations 1 T and model E m yt xt 2 0 applies to the remaining T T observations 17 This describes a nonstandard sort of hypothesis test since under the null hypothesis the parameter of interest is not even part of the model Andrews and Ploberger 1994 denote this a nuisance parameter that is present only under the alternative Suppose were known Then the optimal GMM estimator for the rst subsample would be obtained by minimizing with respect to the parameters 1 the criterion function q1 m1 1 Est Asy Var T m1 1 1 m1 1 m1 1 W1 1 m1 1 where m1 1 1 T
    T

    mt yt xt 1
    t 1

    The asymptotic covariance weighting matrix will generally be computed using a rst round estimator in W1 1 T
    T t 1

    0 0 mt 1 mt 1

    7 21

    In this time series setting it would be natural to accommodate serial correlation in the estimator Following Hall and Sen 1999 the counterpart to the Newey West 1987a estimator see Section 11 3 would be
    B T

    W1 W1 0
    j 1

    w j T W1 j W1 j

    where W1 0 is given in 7 21 and W1 j 1 T
    T t j 1

    0 0 mt 1 m t j 1

    B T is the bandwidth chosen to be O T 1 4 this is the L in 10 16 and 12 17 and w j T is the kernel Newey and West s value for this is the Bartlett kernel 1 j 1 B T See also Andrews 1991 Hayashi 2000 pp 408 409 and the end of Section C 3 The asymptotic covariance matrix for the GMM estimator would then be computed using Est Asy Var 1 1 G W 1 G1 1 T 1
    1

    V1

    17 Andrews 1993 on which this discussion draws heavily allows for some of the parameters to be assumed to

    be constant throughout the sample period This adds some complication to the algebra involved in obtaining the estimator since with this assumption ef cient estimation requires joint estimation of the parameter vectors whereas our formulation allows GMM estimation to proceed with separate subsamples when needed The essential results are the same

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    141

    where G1 1 T
    T t 1

    mt 1 1

    Estimators for the second sample are found by changing the summations to T 1 T and for the full sample by summing from 1 to T Still assuming that is known the three standard test statistics for testing the null hypothesis of model constancy against the alternative of structural break at T would be as follows The Wald statistic is WT 1 2 V1 V2
    1

    1 2

    See Andrews and Fair 1988 There is a small complication with this result in this time series context The two subsamples are generally not independent so the additive result above is not quite appropriate Asymptotically the number of observations close to the switch point if there is one becomes small so this is only a nite sample problem The likelihood ratio like statistic would be LRT q1 1 q2 2 q1 q2 where is based on the full sample This result makes use of our assumption that there are no common parameters so that the criterion for the full sample is the sum of those for the subsamples With common parameters it becomes slightly more complicated The Lagrange multiplier statistic is the most convenient of the three All matrices with subscript T are based on the full sample GMM estimator The weighting and derivative matrices are computed using the full sample The moment equation is computed at the rst subsample though the sum is divided by T not T see Andrews 1993 eqn 4 4 LMT T T m1 T V 1 GT GT V 1 GT T 1
    1

    GT V 1 m1 T T

    The LM statistic is simpler as it requires the model only to be estimated once using the full sample Of course this is a minor virtue The computations for the full sample and the subsamples are the same so the same amount of setup is required either way In each case the statistic has a limiting chi squared distribution with K degrees of freedom where K is the number of parameters in the model Since is unknown the preceding does not solve the problem posed at the outset The CUSUMS and Hansen tests discussed in Section 7 5 were proposed for that purpose but lack power and are generally for linear regression models Andrews 1993 has derived the behavior of the test statistic obtained by computing the statistics suggested previously at the range of candidate values that is the different partitionings of the sample say 0 15 to 85 then retaining the maximum value obtained These are the Sup WT Sup LRT and Sup LMT respectively Although for a given the statistics have limiting chi squared distributions obviously the maximum does not Tables of critical values obtained by Monte Carlo methods are provided in Andrews 1993 An interesting side calculation in the process is to plot the values of the test statistics See the following application Two alternatives to the supremum test are suggested by Andrews and Ploberger 1994 and Sowell 1996 The average statistics

    Greene 50240

    book

    June 11 2002

    18 46

    142

    CHAPTER 7 Functional Form and Structural Change

    Avg WT Avg LRT and Avg LMT are computed by taking the sample average of the sequence of values over the R partitions of the sample from 0 to 1 0 The exponential statistics are computed as Exp WT ln 1 R
    R

    exp 5WT r
    r 1

    and likewise for the LM and LR statistics Tables of critical values for a range of values of 0 and K are provided by the authors 18 Not including the Hall and Sen approaches the preceding provides nine different statistics for testing the hypothesis of parameter constancy though Andrews and Ploberger 1994 suggest that the Exp LR and Avg LR versions are less than optimal As the authors note all are based on statistics which converge to chi squared statistics Andrews and Ploberger present some results to suggest that the exponential form may be preferable based on its power characteristics In principle the preceding suggests a maximum likelihood estimator of or T1 if ML is used as the estimation method Properties of the estimator are dif cult to obtain as shown in Bai 1997 Moreover Bai s 1997 study based on least squares estimation of a linear model includes some surprising results that suggest that in the presence of multiple change points in a sample the outcome of the Andrews and Ploberger tests may depend crucially on what time interval is examined 19
    Example 7 7 Instability of the Demand for Money

    We will examine the demand for money in some detail in Chapters 19 and 20 At this point we will take a cursory look at a simple and questionable model m p t yt i t t where m p and y are the logs of the money supply M1 the price level CPI U and GDP respectively and i is the interest rate 90 day T bill rate in our data set Quarterly data on these and several other macroeconomic variables are given in Appendix F5 1 for the quarters 1950 1 to 2000 4 We will apply the techniques described above to this money demand equation The data span 204 quarters We chose a window from 1957 3 quarter 30 to 1993 3 quarter 175 which correspond roughly to 15 to 85 The function is estimated by GMM using as instruments zt 1 i t i t 1 yt 1 yt 2 We will use a Newey West estimator for the weighting matrix with L 2041 4 4 so we will lose 4 additional
    18 An extension of the Andrews and Ploberger methods based on the overidentifying restrictions in the GMM

    estimator is developed in Hall and Sen 1999 Approximations to the critical values are given by Hansen 1997 Further results are given in Hansen 2000
    19 Bai 1991 Bai Lumsdaine and Stock 1999 Bai and Perron 1998a b and Bai 1997 Estimation of or T1 raises a peculiarity of this strand of literature In many applications the notion of a change point is tied to an historical event such as a war or a major policy shift For example in Bai 1997 p 557 a structural change in an estimated model of the relationship between T bill rates and the Fed s discount rate is associated with a speci c date October 9 1979 a date which marked the beginning of a change in Fed operating procedures A second change date in his sample was associated with the end of that Fed policy regime while a third between these two had no obvious identity In such a case the idea of a xed requires some careful thought as to what is meant by T If the sampling process is de ned to have a true origin in a physical history wherever it is then cannot be xed As T increases must decline to zero and estimation of makes no sense Alternatively if really is meant to denote a speci c proportion of the sample but remains tied to an actual date then presumably increasing the sample size means shifting both origin and terminal in opposite directions at the same rate Otherwise insisting that the regime switch occur at time T has an implausible economic implication Changing the orientation of the search to the change date T1 itself does not remove the ambiguities We leave the philosophical resolution of either interpretation to the reader Andrews 1993 p 845 assessment of the situation is blunt n o optimality properties are known for the ML estimator of

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    143

    TABLE 7 7 Statistic

    Results of Model Stability Tests
    Maximum Average Average exp

    LM Wald LR Critical Value
    a b

    10 43 11 85 15 69 14 15a

    4 42 4 57 4 22b

    3 31 3 67 6 07c

    Andrews 1993 Table I p 3 0 0 15 Andrews and Ploberger 1994 Table II p 3 0 0 15 c Andrews and Ploberger 1994 Table I p 3 0 0 15

    observations after the two lagged values in the instruments Thus the estimation sample is 1951 3 to 2000 4 a total of 197 observations The GMM estimator is precisely the instrumental variables estimator shown in Chapter 5 The estimated equation with standard errors shown in parentheses is m p t 1 824 0 166 0 306 0 0216 yt 0 0218 0 00252 i t et The Lagrange multiplier form of the test is particularly easy to carry out in this framework The sample moment equations are E mT E 1 T
    T

    zt yt xt
    t 1

    0

    The derivative matrix is likewise simple G 1 T Z X The results of the various testing procedures are shown in Table 7 7 The results are mixed some of the statistics reject the hypothesis while others do not Figure 7 7 shows the sequence of test statistics The three are quite consistent If there is a structural break in these data it occurs in the late 1970s These results coincide with Bai s ndings discussed in the preceding footnote
    FIGURE 7 7 Structural Change Test Statistics

    Sequence of Test Statistics for Structural Break 20
    LMSTATS WALDS LRSTATS

    16

    TestStat

    12

    8

    4

    0 1949

    1962

    1975 Quarter

    1988

    2001

    Greene 50240

    book

    June 11 2002

    18 46

    144

    CHAPTER 7 Functional Form and Structural Change

    7 6

    SUMMARY AND CONCLUSIONS This chapter has discussed the functional form of the regression model We examined the use of dummy variables and other transformations to build nonlinearity into the model We then considered other nonlinear models in which the parameters of the nonlinear model could be recovered from estimates obtained for a linear regression The nal sections of the chapter described hypothesis tests designed to reveal whether the assumed model had changed during the sample period or was different for different groups of observations These tests rely on information about when or how the sample is to be partitioned for the test In many time series cases this is unknown Tests designed for this more complex case were considered in Section 7 5 4

    Key Terms and Concepts
    Binary variable Chow test CUSUM test Dummy variable Dummy variable trap Exactly identi ed Hansen s test Identi cation condition Interaction term Intrinsically linear Knots Loglinear model Marginal effect Nonlinear restriction One step ahead prediction Recursive residual Response Semilog model Spline Structural change Threshold effect Time pro le Treatment Wald test

    error
    Overidenti ed Piecewise continuous Predictive test Quali cation indices

    Exercises 1 In Solow s classic 1957 study of technical change in the U S economy he suggests the following aggregate production function q t A t f k t where q t is aggregate output per work hour k t is the aggregate capital labor ratio and A t is the technology index Solow considered four static models q A ln k q A k ln q A ln k and ln q A k Solow s data for the years 1909 to 1949 are listed in Appendix Table F7 2 Use these data to estimate the and of the four functions listed above Note Your results will not quite match Solow s See the next exercise for resolution of the discrepancy In the aforementioned study Solow states A scatter of q A against k is shown in Chart 4 Considering the amount of a priori doctoring which the raw gures have undergone the t is remarkably tight Except that is for the layer of points which are obviously too high These maverick observations relate to the seven last years of the period 1943 1949 From the way they lie almost exactly parallel to the main scatter one is tempted to conclude that in 1943 the aggregate production function simply shifted a Compute a scatter diagram of q A against k b Estimate the four models you estimated in the previous problem including a dummy variable for the years 1943 to 1949 How do your results change Note These results match those reported by Solow although he did not report the coef cient on the dummy variable

    2

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    145

    3

    c Solow went on to surmise that in fact the data were fundamentally different in the years before 1943 than during and after Use a Chow test to examine the difference in the two subperiods using your four functional forms Note that with the dummy variable you can do the test by introducing an interaction term between the dummy and whichever function of k appears in the regression Use an F test to test the hypothesis A regression model with K 16 independent variables is t using a panel of seven years of data The sums of squares for the seven separate regressions and the pooled regression are shown below The model with the pooled data allows a separate constant for each year Test the hypothesis that the same coef cients apply in every year
    1954 1955 1956 1957 1958 1959 1960 All

    Observations ee

    65 104

    55 88

    87 206

    95 144

    103 199

    87 308

    78 211

    570 1425

    4

    Reverse regression A common method of analyzing statistical data to detect discrimination in the workplace is to t the regression y x d 1

    where y is the wage rate and d is a dummy variable indicating either membership d 1 or nonmembership d 0 in the class toward which it is suggested the discrimination is directed The regressors x include factors speci c to the particular type of job as well as indicators of the quali cations of the individual The hypothesis of interest is H0 0 versus H1 0 The regression seeks to answer the question In a given job are individuals in the class d 1 paid less than equally quali ed individuals not in the class d 0 Consider an alternative approach Do individuals in the class in the same job as others and receiving the same wage uniformly have higher quali cations If so this might also be viewed as a form of discrimination To analyze this question Conway and Roberts 1983 suggested the following procedure 1 Fit 1 by ordinary least squares Denote the estimates a b and c 2 Compute the set of quali cation indices q a i Xb Note the omission of cd from the tted value 3 Regress q on a constant y and d The equation is q y d 3 2

    The analysis suggests that if 0 0 a Prove that the theory notwithstanding the least squares estimates c and c are related by c y1 y 1 R2 c 2 1 P 1 r yd 4

    Greene 50240

    book

    June 11 2002

    18 46

    146

    CHAPTER 7 Functional Form and Structural Change

    where y1 mean of y for observations with d 1 y mean of y for all observations P mean of d R2 coef cient of determination for 1 2 r yd squared correlation between y and d Hint The model contains a constant term Thus to simplify the algebra assume that all variables are measured as deviations from the overall sample means and use a partitioned regression to compute the coef cients in 3 Second in 2 use the result that based on the least squares results y a i Xb cd e so q y cd e From here on we drop the constant term Thus in the regression in 3 you are regressing y cd e on y and d b Will the sample evidence necessarily be consistent with the theory Hint Suppose that c 0 A symposium on the Conway and Roberts paper appeared in the Journal of Business and Economic Statistics in April 1983 Reverse regression continued This and the next exercise continue the analysis of Exercise 4 In Exercise 4 interest centered on a particular dummy variable in which the regressors were accurately measured Here we consider the case in which the crucial regressor in the model is measured with error The paper by Kamlich and Polachek 1982 is directed toward this issue Consider the simple errors in the variables model y x x x u

    5

    6

    where u and are uncorrelated and x is the erroneously measured observed counterpart to x a Assume that x u and are all normally distributed with means 0 and 0 2 2 variances u and 2 and zero covariances Obtain the probability limits of the least squares estimators of and b As an alternative consider regressing x on a constant and y and then computing the reciprocal of the estimate Obtain the probability limit of this estimator c Do the direct and reverse estimators bound the true coef cient Reverse regression continued Suppose that the model in Exercise 5 is extended to y x d x x u For convenience we drop the constant term Assume that x and u are independent normally distributed with zero means Suppose that d is a random variable that takes the values one and zero with probabilities and 1 in the population and is independent of all other variables in the model To put this formulation in context the preceding model and variants of it have appeared in the literature on discrimination We view y as a wage variable x as quali cations and x as some imperfect measure such as education The dummy variable d is membership d 1 or nonmembership d 0 in some protected class The hypothesis of discrimination turns on 0 versus 0 a What is the probability limit of c the least squares estimator of in the least squares regression of y on x and d Hints The independence of x and d is important Also plim d d n Var d E2 d 1 2 This minor modi cation does not affect the model substantively but it greatly simpli es the

    Greene 50240

    book

    June 11 2002

    18 46

    CHAPTER 7 Functional Form and Structural Change

    147

    TABLE 7 8

    Ship Damage Incidents
    Period Constructed 1960 1964 1965 1969 1970 1974 1975 1979

    Ship Type

    A B C D E

    0 29 1 0 0

    4 53 1 0 7

    18 44 2 11 12

    11 18 1 4 1

    Source Data from McCullagh and Nelder 1983 p 137

    7

    algebra Now suppose that x and d are not independent In particular suppose that E x d 1 1 and E x d 0 0 Repeat the derivation with this assumption b Consider instead a regression of x on y and d What is the probability limit of the coef cient on d in this regression Assume that x and d are independent c Suppose that x and d are not independent but is in fact less than zero Assuming that both preceding equations still hold what is estimated by y d 1 y d 0 What does this quantity estimate if does equal zero Data on the number of incidents of damage to a sample of ships with the type of ship and the period when it was constructed are given in the Table 7 8 There are ve types of ships and four different periods of construction Use F tests and dummy variable regressions to test the hypothesis that there is no signi cant ship type effect in the expected number of incidents Now use the same procedure to test whether there is a signi cant period effect

    Greene 50240

    book

    June 11 2002

    18 49

    8

    SPECIFICATION ANALYSIS AND MODEL SELECTION

    Q
    8 1 INTRODUCTION Chapter 7 presented results which were primarily focused on sharpening the functional form of the model Functional form and hypothesis testing are directed toward improving the speci cation of the model or using that model to draw generally narrow inferences about the population In this chapter we turn to some broader techniques that relate to choosing a speci c model when there is more than one competing candidate Section 8 2 describes some larger issues related to the use of the multiple regression model speci cally the impacts of an incomplete or excessive speci cation on estimation and inference Sections 8 3 and 8 4 turn to the broad question of statistical methods for choosing among alternative models

    8 2

    SPECIFICATION ANALYSIS AND MODEL BUILDING Our analysis has been based on the assumption that the correct speci cation of the regression model is known to be y X 8 1

    There are numerous types of errors that one might make in the speci cation of the estimated equation Perhaps the most common ones are the omission of relevant variables and the inclusion of super uous variables
    8 2 1 BIAS CAUSED BY OMISSION OF RELEVANT VARIABLES

    Suppose that a correctly speci ed regression model would be y X 1 1 X2 2 8 2

    where the two parts of X have K1 and K2 columns respectively If we regress y on X1 without including X2 then the estimator is b1 X1 X1 1 X1 y 1 X1 X1 1 X1 X2 2 X1 X1 1 X1 8 3

    Taking the expectation we see that unless X1 X2 0 or 2 0 b1 is biased The wellknown result is the omitted variable formula E b1 X 1 P1 2 2
    148

    8 4

    Greene 50240

    book

    June 11 2002

    18 49

    CHAPTER 8 Speci cation Analysis and Model Selection

    149

    where P1 2 X1 X1 1 X1 X2 8 5 Each column of the K1 K2 matrix P1 2 is the column of slopes in the least squares regression of the corresponding column of X2 on the columns of X1
    Example 8 1 Omitted Variables

    If a demand equation is estimated without the relevant income variable then 8 4 shows how the estimated price elasticity will be biased Letting b be the estimator we obtain E b price income Cov price income Var price

    where is the income coef cient In aggregate data it is unclear whether the missing covariance would be positive or negative The sign of the bias in b would be the same as this covariance however because Var price and would be positive The gasoline market data we have examined in Examples 2 3 and 7 6 provide a striking example Figure 7 5 showed a simple plot of per capita gasoline consumption G pop against the price index PG The plot is considerably at odds with what one might expect But a look at the data in Appendix Table F2 2 shows clearly what is at work Holding per capita income I pop and other prices constant these data might well conform to expectations In these data however income is persistently growing and the simple correlations between G pop and I pop and between PG and I pop are 0 86 and 0 58 respectively which are quite large To see if the expected relationship between price and consumption shows up we will have to purge our data of the intervening effect of I pop To do so we rely on the Frisch Waugh result in Theorem 3 3 The regression results appear in Table 7 6 The rst column shows the full regression model with ln PG log Income and several other variables The estimated demand elasticity is 0 11553 which conforms with expectations If income is omitted from this equation the estimated price elasticity is 0 074499 which has the wrong sign but is what we would expect given the theoretical results above

    In this development it is straightforward to deduce the directions of bias when there is a single included variable and one omitted variable It is important to note however that if more than one variable is included then the terms in the omitted variable formula involve multiple regression coef cients which themselves have the signs of partial not simple correlations For example in the demand equation of the previous example if the price of a closely related product had been included as well then the simple correlation between price and income would be insuf cient to determine the direction of the bias in the price elasticity What would be required is the sign of the correlation between price and income net of the effect of the other price This requirement might not be obvious and it would become even less so as more regressors were added to the equation
    8 2 2 PRETEST ESTIMATION

    The variance of b1 is that of the third term in 8 3 which is Var b1 X 2 X1 X1 1 8 6

    If we had computed the correct regression including X2 then the slopes on X1 would have been unbiased and would have had a covariance matrix equal to the upper left block of 2 X X 1 This matrix is Var b1 2 X 2 X1 M2 X1 1 8 7

    Greene 50240

    book

    June 11 2002

    18 49

    150

    CHAPTER 8 Speci cation Analysis and Model Selection

    where M2 I X2 X2 X2 1 X2 or Var b1 2 X 2 X1 X1 X1 X2 X2 X2 1 X2 X1 1 We can compare the covariance matrices of b1 and b1 2 more easily by comparing their inverses see result A 120 Var b1 X 1 Var b1 2 X 1 1 2 X1 X2 X2 X2 1 X2 X1 which is nonnegative de nite We conclude that although b1 is biased its variance is never larger than that of b1 2 since the inverse of its variance is at least as large Suppose for instance that X1 and X2 are each a single column and that the variables are measured as deviations from their respective means Then Var b1 X whereas Var b1 2 X 2 x1 x1 x1 x2 x2 x2 1 x2 x1 1 where
    2 r12

    2 s11

    n

    where s11
    i 1

    xi 1 x1 2

    s11

    2 2 1 r12

    8 8

    x1 x2 2 x1 x1 x2 x2

    is the squared sample correlation between x1 and x2 The more highly correlated x1 and x2 are the larger is the variance of b1 2 compared with that of b1 Therefore it is possible that b1 is a more precise estimator based on the mean squared error criterion The result in the preceding paragraph poses a bit of a dilemma for applied researchers The situation arises frequently in the search for a model speci cation Faced with a variable that a researcher suspects should be in their model but which is causing a problem of collinearity the analyst faces a choice of omitting the relevant variable or including it and estimating its and all the other variables coef cient imprecisely This presents a choice between two estimators b1 and b1 2 In fact what researchers usually do actually creates a third estimator It is common to include the problem variable provisionally If its t ratio is suf ciently large it is retained otherwise it is discarded This third estimator is called a pretest estimator What is known about pretest estimators is not encouraging Certainly they are biased How badly depends on the unknown parameters Analytical results suggest that the pretest estimator is the least precise of the three when the researcher is most likely to use it See Judge et al 1985
    8 2 3 INCLUSION OF IRRELEVANT VARIABLES

    If the regression model is correctly given by y X1 1 8 9

    Greene 50240

    book

    June 11 2002

    18 49

    CHAPTER 8 Speci cation Analysis and Model Selection

    151

    and we estimate it as if 8 2 were correct i e we include some extra variables then it might seem that the same sorts of problems considered earlier would arise In fact this case is not true We can view the omission of a set of relevant variables as equivalent to imposing an incorrect restriction on 8 2 In particular omitting X2 is equivalent to incorrectly estimating 8 2 subject to the restriction 2 0 As we discovered incorrectly imposing a restriction produces a biased estimator Another way to view this error is to note that it amounts to incorporating incorrect information in our estimation Suppose however that our error is simply a failure to use some information that is correct The inclusion of the irrelevant variables X2 in the regression is equivalent to failing to impose 2 0 on 8 2 in estimation But 8 2 is not incorrect it simply fails to incorporate 2 0 Therefore we do not need to prove formally that the least squares estimator of in 8 2 is unbiased even given the restriction we have already proved it We can assert on the basis of all our earlier results that E b X By the same reasoning s 2 is also unbiased E ee X 2 n K1 K2 8 11 1 1 2 0 8 10

    Then where is the problem It would seem that one would generally want to over t the model From a theoretical standpoint the dif culty with this view is that the failure to use correct information is always costly In this instance the cost is the reduced precision of the estimates As we have shown the covariance matrix in the short regression omitting X2 is never larger than the covariance matrix for the estimator obtained in the presence of the super uous variables 1 Consider again the single variable comparison given earlier If x2 is highly correlated with x1 then incorrectly including it in the regression will greatly in ate the variance of the estimator
    8 2 4 MODEL BUILDING A GENERAL TO SIMPLE STRATEGY

    There has been a shift in the general approach to model building in the last 20 years or so partly based on the results in the previous two sections With an eye toward maintaining simplicity model builders would generally begin with a small speci cation and gradually build up the model ultimately of interest by adding variables But based on the preceding results we can surmise that just about any criterion that would be used to decide whether to add a variable to a current speci cation would be tainted by the biases caused by the incomplete speci cation at the early steps Omitting variables from the equation seems generally to be the worse of the two errors Thus the simple to general approach to model building has little to recommend it Building on the work of Hendry e g 1995 and aided by advances in estimation hardware and software researchers are now more comfortable beginning their speci cation searches with large elaborate models

    1 There is no loss if X

    1 X2 0 which makes sense in terms of the information about X1 contained in X2 here none This situation is not likely to occur in practice however

    Greene 50240

    book

    June 11 2002

    18 49

    152

    CHAPTER 8 Speci cation Analysis and Model Selection

    involving many variables and perhaps long and complex lag structures The attractive strategy is then to adopt a general to simple downward reduction of the model to the preferred speci cation Of course this must be tempered by two related considerations In the kitchen sink regression which contains every variable that might conceivably be relevant the adoption of a xed probability for the type I error say 5 percent assures that in a big enough model some variables will appear to be signi cant even if by accident Second the problems of pretest estimation and stepwise model building also pose some risk of ultimately misspecifying the model To cite one unfortunately common example the statistics involved often produce unexplainable lag structures in dynamic models with many lags of the dependent or independent variables

    8 3

    CHOOSING BETWEEN NONNESTED MODELS The classical testing procedures that we have been using have been shown to be most powerful for the types of hypotheses we have considered 2 Although use of these procedures is clearly desirable the requirement that we express the hypotheses in the form of restrictions on the model y X H0 R q versus H1 R q can be limiting Two common exceptions are the general problem of determining which of two possible sets of regressors is more appropriate and whether a linear or loglinear model is more appropriate for a given analysis For the present we are interested in comparing two competing linear models H0 y X 0 and H1 y Z 1 8 12b 8 12a

    The classical procedures we have considered thus far provide no means of forming a preference for one model or the other The general problem of testing nonnested hypotheses such as these has attracted an impressive amount of attention in the theoretical literature and has appeared in a wide variety of empirical applications 3 Before turning to classical frequentist based tests in this setting we should note that the Bayesian approach to this question might be more intellectually appealing Our procedures will continue to be directed toward an objective of rejecting one model in favor of the other Yet in fact if we have doubts as to which of two models is appropriate then we might well be convinced to concede that possibly neither one is really the truth We have rather painted ourselves into a corner with our left or right
    2 See

    for example Stuart and Ord 1989 Chap 27

    3 Recent

    surveys on this subject are White 1982a 1983 Gourieroux and Monfort 1994 McAleer 1995 and Pesaran and Weeks 2001 McAleer s survey tabulates an array of applications while Gourieroux and Monfort focus on the underlying theory

    Greene 50240

    book

    June 11 2002

    18 49

    CHAPTER 8 Speci cation Analysis and Model Selection

    153

    approach The Bayesian approach to this question treats it as a problem of comparing the two hypotheses rather than testing for the validity of one over the other We enter our sampling experiment with a set of prior probabilities about the relative merits of the two hypotheses which is summarized in a prior odds ratio P01 Prob H0 Prob H1 After gathering our data we construct the Bayes factor which summarizes the weight of the sample evidence in favor of one model or the other After the data have been analyzed we have our posterior odds ratio P01 data Bayes factor P01 The upshot is that ex post neither model is discarded we have merely revised our assessment of the comparative likelihood of the two in the face of the sample data Some of the formalities of this approach are discussed in Chapter 16
    8 3 1 TESTING NONNESTED HYPOTHESES

    A useful distinction between hypothesis testing as discussed in the preceding chapters and model selection as considered here will turn on the asymmetry between the null and alternative hypotheses that is a part of the classical testing procedure 4 Since by construction the classical procedures seek evidence in the sample to refute the null hypothesis how one frames the null can be crucial to the outcome Fortunately the Neyman Pearson methodology provides a prescription the null is usually cast as the narrowest model in the set under consideration On the other hand the classical procedures never reach a sharp conclusion Unless the signi cance level of the testing procedure is made so high as to exclude all alternatives there will always remain the possibility of a type one error As such the null is never rejected with certainty but only with a prespeci ed degree of con dence Model selection tests in contrast give the competing hypotheses equal standing There is no natural null hypothesis However the end of the process is a rm decision in testing 8 12a b one of the models will be rejected and the other will be retained the analysis will then proceed in the framework of that one model and not the other Indeed it cannot proceed until one of the models is discarded It is common for example in this new setting for the analyst rst to test with one model cast as the null then with the other Unfortunately given the way the tests are constructed it can happen that both or neither model is rejected in either case further analysis is clearly warranted As we shall see the science is a bit inexact The earliest work on nonnested hypothesis testing notably Cox 1961 1962 was done in the framework of sample likelihoods and maximum likelihood procedures Recent developments have been structured around a common pillar labeled the encompassing principle Mizon and Richard 1986 In the large the principle directs attention to the question of whether a maintained model can explain the features of its competitors that is whether the maintained model encompasses the alternative Yet a third approach is based on forming a comprehensive model which contains both competitors as special cases When possible the test between models can be based essentially on classical like testing procedures We will examine tests that exemplify all three approaches
    4 See

    Granger and Pesaran 2000 for discussion

    Greene 50240

    book

    June 11 2002

    18 49

    154

    CHAPTER 8 Speci cation Analysis and Model Selection 8 3 2 AN ENCOMPASSING MODEL

    The encompassing approach is one in which the ability of one model to explain features of another is tested Model 0 encompasses Model 1 if the features of Model 1 can be explained by Model 0 but the reverse is not true 5 Since H0 cannot be written as a restriction on H1 none of the procedures we have considered thus far is appropriate One possibility is an arti cial nesting of the two models Let X be the set of variables in likewise with respect to X and let W be the variables that X that are not in Z de ne Z the models have in common Then H0 and H1 could be combined in a supermodel y X Z W In principle H1 is rejected if it is found that 0 by a conventional F test whereas H0 is rejected if it is found that 0 There are two problems with this approach First remains a mixture of parts of and and it is not established by the F test that either of these parts is zero Hence this test does not really distinguish between H0 and H1 it distinguishes between H1 and a hybrid model Second this compound model may have an extremely large number of regressors In a time series setting the problem of collinearity may be severe Consider an alternative approach If H0 is correct then y will apart from the random disturbance be fully explained by X Suppose we then attempt to estimate by regression of y on Z Whatever set of parameters is estimated by this regression say c if H0 is correct then we should estimate exactly the same coef cient vector if we were to regress X on Z since 0 is random noise under H0 Since must be estimated suppose that we use Xb instead and compute c0 A test of the proposition that Model 0 encompasses Model 1 would be a test of the hypothesis that E c c0 0 It is straightforward to show see Davidson and MacKinnon 1993 pp 384 387 that the test can be carried out by using a standard F test to test the hypothesis that 1 0 in the augmented regression y X Z1 1 1 where Z1 is the variables in Z that are not in X
    8 3 3 COMPREHENSIVE APPROACH THE J TEST

    The underpinnings of the comprehensive approach are tied to the density function as the characterization of the data generating process Let f0 yi data 0 be the assumed density under Model 0 and de ne the alternative likewise as f1 yi data 1 Then a comprehensive model which subsumes both of these is fc yi data 0 1 f0 yi data 0 1 f1 yi data 1 1 f y data dy 1i i 1 range of yi f0 yi data 0

    Estimation of the comprehensive model followed by a test of 0 or 1 is used to assess the validity of Model 0 or 1 respectively 6
    5 See 6 See

    Deaton 1982 Dastoor 1983 Gourieroux et al 1983 1995 and especially Mizon and Richard 1986 Section 21 4 4c for an application to the choice of probit or logit model for binary choice suggested by Silva 2001

    Greene 50240

    book

    June 11 2002

    18 49

    CHAPTER 8 Speci cation Analysis and Model Selection

    155

    The J test proposed by Davidson and MacKinnon 1981 can be shown see Pesaran and Weeks 2001 to be an application of this principle to the linear regression model Their suggested alternative to the preceding compound model is y 1 X Z In this model a test of 0 would be a test against H1 The problem is that cannot be separately estimated in this model it would amount to a redundant scaling of the regression coef cients Davidson and MacKinnon s J test consists of estimating by a least squares regression of y on Z followed by a least squares regression of y on X and Z the tted values in the rst regression A valid test at least asymptotically of H1 is to test H0 0 If H0 is true then plim 0 Asymptotically the ratio se i e the usual t ratio is distributed as standard normal and may be referred to the standard table to carry out the test Unfortunately in testing H0 versus H1 and vice versa all four possibilities reject both neither or either one of the two hypotheses could occur This issue however is a nite sample problem Davidson and MacKinnon show that as n if H1 is true then the probability that will differ signi cantly from zero approaches 1
    Example 8 2 J Test for a Consumption Function

    Gaver and Geisel 1974 propose two forms of a consumption function H 0 Ct 1 2 Yt 3 Yt 1 0t and H 1 Ct 1 2 Yt 3 Ct 1 1t The rst model states that consumption responds to changes in income over two periods whereas the second states that the effects of changes in income on consumption persist for many periods Quarterly data on aggregate U S real consumption and real disposable income are given in Table F5 1 Here we apply the J test to these data and the two proposed speci cations First the two models are estimated separately using observations 1950 2 2000 4 The least squares regression of C on a constant Y lagged Y and the tted values from the second model produces an estimate of of 1 0145 with a t ratio of 62 861 Thus H 0 should be rejected in favor of H 1 But reversing the roles of H 0 and H 1 we obtain an estimate of of 10 677 with a t ratio of 7 188 Thus H 1 is rejected as well 7
    8 3 4 THE COX TEST8

    Likelihood ratio tests rely on three features of the density of the random variable of interest First under the null hypothesis the average log density of the null hypothesis will be less than under the alternative this is a consequence of the fact that the null model is nested within the alternative Second the degrees of freedom for the chisquared statistic is the reduction in the dimension of the parameter space that is speci ed by the null hypothesis compared to the alternative Third in order to carry out the test under the null hypothesis the test statistic must have a known distribution which is free of the model parameters under the alternative hypothesis When the models are
    7 For 8 The

    related discussion of this possibility see McAleer Fisher and Volker 1982

    Cox test is based upon the likelihood ratio statistic which will be developed in Chapter 17 The results for the linear regression model however are based on sums of squared residuals and therefore rely on nothing more than least squares which is already familiar

    Greene 50240

    book

    June 11 2002

    18 49

    156

    CHAPTER 8 Speci cation Analysis and Model Selection

    nonnested none of these requirements will be met The rst need not hold at all With regard to the second the parameter space under the null model may well be larger than or at least the same size as under the alternative Merely reversing the two models does not solve this problem The test must be able to work in both directions Finally because of the symmetry of the null and alternative hypotheses the distributions of likelihood based test statistics will generally be functions of the parameters of the alternative model Cox s 1961 1962 analysis of this problem produced a reformulated test statistic that is based on the standard normal distribution and is centered at zero 9 Versions of the Cox test appropriate for the linear and nonlinear regression models have been derived by Pesaran 1974 and Pesaran and Deaton 1978 The latter present a test statistic for testing linear versus loglinear models that is extended in AneurynEvans and Deaton 1980 Since in the classical regression model the least squares estimator is also the maximum likelihood estimator it is perhaps not surprising that Davidson and MacKinnon 1981 p 789 nd that their test statistic is asymptotically equal to the negative of the Cox Pesaran and Deaton statistic The Cox statistic for testing the hypothesis that X is the correct set of regressors and that Z is not is c01 where MZ I Z Z Z 1 Z MX I X X X 1 X b X X 1 X y 2 sZ eZ eZ n mean squared residual in the regression of y on Z 2 sX eX eX n mean squared residual in the regression of y on X 2 2 sZX sX b X MZ Xb n The hypothesis is tested by comparing q c01 Est Var c01
    1 2 2 n n sZ s2 ln 2Z ln 2 2 2 sX 1 n b X MZ Xb sZX

    8 13



    c01
    2 sX b X MZ MX MZ Xb 4 sZX

    8 14

    to the critical value from the standard normal table A large value of q is evidence against the null hypothesis H0 The Cox test appears to involve an impressive amount of matrix algebra But the algebraic results are deceptive One needs only to compute linear regressions and retrieve tted values and sums of squared residuals The following does the rst test The roles of X and Z are reversed for the second 1 2
    9 See

    2 Regress y on X to obtain b and yX Xb eX y Xb sX eX eX n 2 Regress y on Z to obtain d and yZ Zd eZ y Zd sZ eZ eZ n

    Pesaran and Weeks 2001 for some of the formalities of these results

    Greene 50240

    book

    June 11 2002

    18 49

    CHAPTER 8 Speci cation Analysis and Model Selection

    157

    3 4 5 6

    Regress yX on Z to obtain dX and eZ X yX ZdX MZ Xb eZ X eZ X b X MZ Xb Regress eZ X on X and compute residuals eX ZX eX ZX eX ZX b X MZ MX MZ Xb 2 2 Compute sZX sX eZ X eZ X n Compute c01
    n 2

    log s 2Z v01
    ZX

    s2

    2 sX eX ZX eX ZX 4 sZX

    q

    c 01 v01

    Therefore the Cox statistic can be computed simply by computing a series of least squares regressions
    Example 8 3 Cox Test for a Consumption Function

    We continue the previous example by applying the Cox test to the data of Example 8 2 For purposes of the test let X i y y 1 and Z i y c 1 Using the notation of 8 13 and 8 14 we nd that
    2 sX 7 556 657 2 sZ 456 3751

    b X M Z Xb 167 50707 b X M Z M X M Z Xb 2 61944
    2 sZX 7556 657 167 50707 203 7 557 483

    Thus c01 and Est Var c01 7 556 657 2 61944 0 00034656 7 557 4832 456 3751 203 ln 2 7 557 483 284 908

    Thus q 15 304 281 On this basis we reject the hypothesis that X is the correct set of regressors Note in the previous example that we reached the same conclusion based on a t ratio of 62 861 As expected the result has the opposite sign from the corresponding J statistic in the previous example Now we reverse the roles of X and Z in our calculations Letting d denote the least squares coef cients in the regression of consumption on Z we nd that d Z M X Zd 1 418 985 185 d Z M X M Z M X Zd 22 189 811
    2 sXZ 456 3751 1 418 985 185 203 7446 4499

    Thus c10 and Est Var c10 456 3751 22 189 811 0 18263 7 446 44992 7 556 657 203 ln 2 7 446 4499 1 491

    This computation produces a value of q 3 489 which is roughly equal in absolute value than its counterpart in Example 8 2 7 188 Since 1 594 is less than the 5 percent critical value of to 1 96 we once again reject the hypothesis that Z is the preferred set of regressors though the results do strongly favor Z in qualitative terms

    Greene 50240

    book

    June 11 2002

    18 49

    158

    CHAPTER 8 Speci cation Analysis and Model Selection

    Pesaran and Hall 1988 have extended the Cox test to testing which of two nonnested restricted regressions is preferred The modeling framework is H0 H0 y X0 0 0 y X1 1 1 Var 0 X0 2 I 0 Var 1 X1 2 I 1 subject to R0 0 q0 subject to R1 1 q1

    Like its counterpart for unrestricted regressions this Cox test requires a large amount of matrix algebra However once again it reduces to a sequence of regressions though this time with some unavoidable matrix manipulation remaining Let Gi Xi Xi 1 Xi Xi 1 Ri Ri Xi Xi 1 Ri 1 Ri Xi Xi 1 i 0 1

    and Ti Xi Gi Xi mi rank Ri ki rank Xi hi ki mi and di n hi where n is the sample size The following steps produce the needed statistics 1 2 3 Compute ei the residuals from the restricted regression i 0 1 Compute e10 by computing the residuals from the restricted regression of y e0 on X1 Compute e01 likewise by reversing the subscripts Compute e100 as the residuals from the restricted regression of y e10 on X0 and e110 likewise by reversing the subscripts Let vi vi j and vi jk denote the sums of squared residuals in Steps 1 2 and 3 and let si2 ei ei di 2 2 Compute trace B0 h1 trace T0 T1 2 h1 trace T0 T1 2 n h0 and 2 trace B1 likewise by reversing subscripts 2 2 2 Compute s10 v10 s0 trace I T0 T1 T0 T1 and s01 likewise

    4 5

    The authors propose several statistics A Wald test based on Godfrey and Pesaran 1983 2 is based on the difference between an estimator of 1 and the probability limit of this estimator assuming that H0 is true W0 n v1 v0 v10 4v 0 v100 Under the null hypothesis of Model 0 the limiting distribution of W0 is standard normal An alternative statistic based on Cox s likelihood approach is
    22 N0 n 2 ln s1 s10 2 2 4v100 s0 s10 2

    Example 8 4

    Cox Test for Restricted Regressions

    The example they suggest is two competing models for expected in ation Pte based on commonly used lag structures involving lags of Pte and current lagged values of actual in ation Pt Regressive Pte Pt 1 Pt Pt 1 2 Pt 1 Pt 2 0t Adaptive Pte Pte 1 1 Pt Pte 1 2 Pt 1 Pte 2 1t

    By formulating these models as yt 1 Pte 1 2 Pte 2 3 Pt 4 Pt 1 5 Pt 2 t

    Greene 50240

    book

    June 11 2002

    18 49

    CHAPTER 8 Speci cation Analysis and Model Selection

    159

    They show that the hypotheses are H 0 H 1 1 2 0 1 3 1 3 4 5 1 2 4 0 5 0

    Pesaran and Hall s analysis was based on quarterly data for British manufacturing from 1972 to 1981 The data appear in the Appendix to Pesaran 1987 and are reproduced in Table F8 1 Using their data the computations listed before produce the following results W0 N0 Null is H 0 3 887 Null is H 0 2 437 Null is H 1 0 134 Null is H 1 0 032

    These results fairly strongly support Model 1 and lead to rejection of Model 0 10

    8 4

    MODEL SELECTION CRITERIA The preceding discussion suggested some approaches to model selection based on nonnested hypothesis tests Fit measures and testing procedures based on the sum of squared residuals such as R2 and the Cox test are useful when interest centers on the within sample t or within sample prediction of the dependent variable When the model building is directed toward forecasting within sample measures are not necessarily optimal As we have seen R2 cannot fall when variables are added to a model so there is a built in tendency to over t the model This criterion may point us away from the best forecasting model because adding variables to a model may increase the variance of the forecast error see Section 6 6 despite the improved t to the data With this thought in mind the adjusted R2 R2 1 n 1 n 1 1 R2 1 n K n K ee
    n i 1 yi

    y 2



    8 15

    has been suggested as a t measure that appropriately penalizes the loss of degrees of freedom that result from adding variables to the model Note that R2 may fall when a variable is added to a model if the sum of squares does not fall fast enough The applicable result appears in Theorem 3 7 R2 does not rise when a variable is added to a model unless the t ratio associated with that variable exceeds one in absolute value The adjusted R2 has been found to be a preferable t measure for assessing the t of forecasting models See Diebold 1998b p 87 who argues that the simple R2 has a downward bias as a measure of the out of sample one step ahead prediction error variance The adjusted R2 penalizes the loss of degrees of freedom that occurs when a model is expanded There is however some question about whether the penalty is suf ciently large to ensure that the criterion will necessarily lead the analyst to the correct model assuming that it is among the ones considered as the sample size increases Two alternative t measures that have seen suggested are the Akaike information criterion
    2 AIC K s y 1 R2 e2 K n
    10 Our

    8 16

    results differ somewhat from Pesaran and Hall s For the rst row of the table they reported 2 180 1 690 and for the second 2 456 1 907 They reach the same conclusion but the numbers do differ substantively We have been unable to resolve the difference

    Greene 50240

    book

    June 11 2002

    18 49

    160

    CHAPTER 8 Speci cation Analysis and Model Selection

    and the Schwartz or Bayesian information criterion
    2 BIC K s y 1 R2 n K n

    8 17

    2 There is no degrees of freedom correction in s y Both measures improve decline as 2 R increases but everything else constant degrade as the model size increases Like R2 these measures place a premium on achieving a given t with a smaller number of parameters per observation K n Logs are usually more convenient the measures reported by most software are

    AIC K log BIC K log

    ee n ee n



    2K n K log n n

    8 18 8 19

    Both prediction criteria have their virtues and neither has an obvious advantage over the other See Diebold 1998b p 90 The Schwarz criterion with its heavier penalty for degrees of freedom lost will lean toward a simpler model All else given simplicity does have some appeal

    8 5

    SUMMARY AND CONCLUSIONS This is the last of seven chapters that we have devoted speci cally to the most heavily used tool in econometrics the classical linear regression model We began in Chapter 2 with a statement of the regression model Chapter 3 then described computation of the parameters by least squares a purely algebraic exercise Chapters 4 and 5 reinterpreted least squares as an estimator of an unknown parameter vector and described the nite sample and large sample characteristics of the sampling distribution of the estimator Chapters 6 and 7 were devoted to building and sharpening the regression model with tools for developing the functional form and statistical results for testing hypotheses about the underlying population In this chapter we have examined some broad issues related to model speci cation and selection of a model among a set of competing alternatives The concepts considered here are tied very closely to one of the pillars of the paradigm of econometrics that underlying the model is a theoretical construction a set of true behavioral relationships that constitute the model It is only on this notion that the concepts of bias and biased estimation and model selection make any sense bias as a concept can only be described with respect to some underlying model against which an estimator can be said to be biased That is there must be a yardstick This concept is a central result in the analysis of speci cation where we considered the implications of under tting omitting variables and over tting including super uous variables the model We concluded this chapter and our discussion of the classical linear regression model with an examination of procedures that are used to choose among competing model speci cations

    Greene 50240

    book

    June 11 2002

    18 49

    CHAPTER 8 Speci cation Analysis and Model Selection

    161

    Key Terms and Concepts
    Adjusted R squared Akaike criterion Biased estimator Comprehensive model Cox test Encompassing principle General to simple strategy Inclusion of super uous J test Mean squared error Model selection Nonnested models Omission of relevant Schwarz criterion Simple to general Speci cation analysis Stepwise model building

    variables
    Omitted variable formula Prediction criterion Pretest estimator

    variables

    Exercises 1 Suppose the true regression model is given by 8 2 The result in 8 4 shows that if either P1 2 is nonzero or 2 is nonzero then regression of y on X1 alone produces a biased and inconsistent estimator of 1 Suppose the objective is to forecast y not to estimate the parameters Consider regression of y on X1 alone to estimate 1 with b1 which is biased Is the forecast of y computed using X1 b1 also biased Assume that E X2 X1 is a linear function of X1 Discuss your ndings generally What are the implications for prediction when variables are omitted from a regression Compare the mean squared errors of b1 and b1 2 in Section 8 2 2 Hint The comparison depends on the data and the model parameters but you can devise a compact expression for the two quantities The J test in Example 8 2 is carried out using over 50 years of data It is optimistic to hope that the underlying structure of the economy did not change in 50 years Does the result of the test carried out in Example 8 2 persist if it is based on data only from 1980 to 2000 Repeat the computation with this subset of the data The Cox test in Example 8 3 has the same dif culty as the J test in Example 8 2 The sample period might be too long for the test not to have been affected by underlying structural change Repeat the computations using the 1980 to 2000 data

    2

    3

    4

    Greene 50240

    book

    June 11 2002

    19 33

    9

    NONLINEAR REGRESSION MODELS

    Q
    9 1 INTRODUCTION Although the linear model is exible enough to allow great variety in the shape of the regression it still rules out many useful functional forms In this chapter we examine regression models that are intrinsically nonlinear in their parameters This allows a much wider range of functional forms than the linear model can accommodate 1

    9 2

    NONLINEAR REGRESSION MODELS The general form of the nonlinear regression model is yi h xi i 9 1 The linear model is obviously a special case Moreover some models which appear to be nonlinear such as y e 0 x1 1 x2 2 e become linear after a transformation in this case after taking logarithms In this chapter we are interested in models for which there is no such transformation such as the ones in the following examples
    Example 9 1 CES Production Function


    In Example 7 5 we examined a constant elasticity of substitution production function model ln y ln ln K 1 L No transformation renders this equation linear in the parameters We did nd however that a linear Taylor series approximation to this function around the point 0 produced an intrinsically linear equation that could be t by least squares Nonetheless the true model is nonlinear in the sense that interests us in this chapter
    Example 9 2 Translog Demand System

    Christensen Jorgenson and Lau 1975 proposed the translog indirect utility function for a consumer allocating a budget among K commodities
    K K K

    ln V 0
    k 1 1A

    k ln pk M
    k 1 l 1

    kl ln pk M ln pl M

    complete discussion of this subject can be found in Amemiya 1985 Other important references are Jennrich 1969 Malinvaud 1970 and especially Goldfeld and Quandt 1971 1972 A very lengthy authoritative treatment is the text by Davidson and MacKinnon 1993

    162

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    163

    where V is indirect utility pk is the price for the kth commodity and M is income Roy s identity applied to this logarithmic function produces a budget share equation for the kth commodity that is of the form Sk k ln V ln pk ln V ln M M
    K j 1 K j 1

    kj ln pj M M j ln pj M



    k 1 K

    and M j No transformation of the budget share equation prowhere M kk k kj duces a linear model This is an intrinsically nonlinear regression model It is also one among a system of equations an aspect we will ignore for the present
    9 2 1 ASSUMPTIONS OF THE NONLINEAR REGRESSION MODEL

    We shall require a somewhat more formal de nition of a nonlinear regression model Suf cient for our purposes will be the following which include the linear model as the special case noted earlier We assume that there is an underlying probability distribution or data generating process DGP for the observable yi and a true parameter vector which is a characteristic of that DGP The following are the assumptions of the nonlinear regression model 1 Functional form The conditional mean function for yi given xi is E yi xi h xi 2 i 1 n

    3

    where h xi is a twice continuously differentiable function Identi ability of the model parameters The parameter vector in the model is identi ed estimable if there is no nonzero parameter 0 such that h xi 0 h xi for all xi In the linear model this was the full rank assumption but the simple absence of multicollinearity among the variables in x is not suf cient to produce this condition in the nonlinear regression model Note that the model given in Example 9 2 is not identi ed If the parameters in the model are all multiplied by the same nonzero constant the same conditional mean function results This condition persists even if all the variables in the model are linearly independent The indeterminacy was removed in the study cited by imposing the normalization M 1 Zero mean of the disturbance It follows from Assumption 1 that we may write yi h xi i where E i h xi 0 This states that the disturbance at observation i is uncorrelated with the conditional mean function for all observations in the sample This is not quite the same as assuming that the disturbances and the exogenous variables are uncorrelated which is the familiar assumption however We will return to this point below Homoscedasticity and nonautocorrelation As in the linear model we assume conditional homoscedasticity E i2 h x j j 1 n 2 a nite constant and nonautocorrelation E i j h xi h x j j 1 n 0 for all j i 9 2

    4

    Greene 50240

    book

    June 11 2002

    19 33

    164

    CHAPTER 9 Nonlinear Regression Models

    5

    6

    Data generating process The data generating process for xi is assumed to be a well behaved population such that rst and second moments of the data can be assumed to converge to xed nite population counterparts The crucial assumption is that the process generating xi is strictly exogenous to that generating i The data on xi are assumed to be well behaved Underlying probability model There is a well de ned probability distribution generating i At this point we assume only that this process produces a sample of uncorrelated identically marginally distributed random variables i with mean 0 and variance 2 conditioned on h xi Thus at this point our statement of the model is semiparametric See Section 16 3 We will not be assuming any particular distribution for i The conditional moment assumptions in 3 and 4 will be suf cient for the results in this chapter In Chapter 17 we will fully parameterize the model by assuming that the disturbances are normally distributed This will allow us to be more speci c about certain test statistics and in addition allow some generalizations of the regression model The assumption is not necessary here
    THE ORTHOGONALITY CONDITION AND THE SUM OF SQUARES

    9 2 2

    Assumptions 1 and 3 imply that E i h xi 0 In the linear model it follows because of the linearity of the conditional mean that i and xi itself are uncorrelated However uncorrelatedness of i with a particular nonlinear function of xi the regression function does not necessarily imply uncorrelatedness with xi itself nor for that matter with other nonlinear functions of xi On the other hand the results we will obtain below for the behavior of the estimator in this model are couched not in terms of xi but in terms of certain functions of xi the derivatives of the regression function so in point of fact E X 0 is not even the assumption we need The foregoing is not a theoretical ne point Dynamic models which are very common in the contemporary literature would greatly complicate this analysis If it can be assumed that i is strictly uncorrelated with any prior information in the model including previous disturbances then perhaps a treatment analogous to that for the linear model would apply But the convergence results needed to obtain the asymptotic properties of the estimator still have to be strengthened The dynamic nonlinear regression model is beyond the reach of our treatment here Strict independence of i and xi would be suf cient for uncorrelatedness of i and every function of xi but again in a dynamic model this assumption might be questionable Some commentary on this aspect of the nonlinear regression model may be found in Davidson and MacKinnon 1993 If the disturbances in the nonlinear model are normally distributed then the log of the normal density for the i th observation will be ln f yi xi 2 1 2 ln 2 ln 2 i2 2 9 3

    For this special case we have from item D 2 in Theorem 17 2 on maximum likelihood estimation that the derivatives of the log density with respect to the parameters have mean zero That is E ln f yi xi 2 1 E 2 h xi i 0 9 4

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    165

    so in the normal case the derivatives and the disturbances are uncorrelated Whether this can be assumed to hold in other cases is going to be model speci c but under reasonable conditions we would assume so See Ruud 2000 p 540 In the context of the linear model the orthogonality condition E xi i 0 produces least squares as a GMM estimator for the model See Chapter 18 The orthogonality condition is that the regressors and the disturbance in the model are uncorrelated In this setting the same condition applies to the rst derivatives of the conditional mean function The result in 9 4 produces a moment condition which will de ne the nonlinear least squares estimator as a GMM estimator
    Example 9 3 First Order Conditions for a Nonlinear Model

    The rst order conditions for estimating the parameters of the nonlinear model yi 1 2 e 3 x i by nonlinear least squares see 9 10 are S b b1 S b b2 S b b3
    n

    yi b1 b2 eb3 xi 0
    i 1 n

    yi b1 b2 eb3 xi eb3 xi 0
    i 1 n

    yi b1 b2 eb3 xi b2 xi eb3 xi 0
    i 1

    These equations do not have an explicit solution

    Conceding the potential for ambiguity we de ne a nonlinear regression model at this point as follows

    DEFINITION 9 1 Nonlinear Regression Model A nonlinear regression model is one for which the rst order conditions for least squares estimation of the parameters are nonlinear functions of the parameters

    Thus nonlinearity is de ned in terms of the techniques needed to estimate the parameters not the shape of the regression function Later we shall broaden our de nition to include other techniques besides least squares
    9 2 3 THE LINEARIZED REGRESSION

    The nonlinear regression model is y h x To save some notation we have dropped the observation subscript The sampling theory results that have been obtained for nonlinear regression models are based on a linear Taylor series approximation to h x at a particular value for the parameter vector 0
    K

    h x h x
    0 k 1

    h x 0 0 k k 0 k

    9 5

    Greene 50240

    book

    June 11 2002

    19 33

    166

    CHAPTER 9 Nonlinear Regression Models

    This form of the equation is called the linearized regression model By collecting terms we obtain
    K

    h x h x 0
    k 1

    0 k

    h x 0 0 k

    K


    k 1

    k

    h x 0 0 k

    9 6

    0 0 0 Let xk equal the kth partial derivative 2 h x 0 k For a given value of 0 xk is a function only of the data not of the unknown parameters We now have K K 00 xk k k 1 k 1 0 xk k

    h x h0 which may be written

    h x h0 x0 0 x0 which implies that y h0 x0 0 x0 By placing the known terms on the left hand side of the equation we obtain a linear equation y0 y h0 x0 0 x0 0 9 7

    Note that 0 contains both the true disturbance and the error in the rst order Taylor series approximation to the true regression shown in 9 6 That is
    K K 00 xk k k 1 k 1 0 xk k

    0 h x

    h0



    9 8

    Since all the errors are accounted for 9 7 is an equality not an approximation With a value of 0 in hand we could compute y0 and x0 and then estimate the parameters of 9 7 by linear least squares Whether this estimator is consistent or not remains to be seen
    Example 9 4 Linearized Regression

    For the model in Example 9 3 the regressors in the linearized equation would be
    0 x1

    h 1 0 1 h 0 e 3 x 0 2 h 0 0 2 xe 3 x 0 3

    0 x2

    0 x3

    With a set of values of the parameters 0
    0 0 0 00 00 00 y0 y h x 1 2 3 1 x1 2 x2 3 x3

    could be regressed on the three variables previously de ned to estimate 1 2 and 3
    2 You

    should verify that for the linear regression model these derivatives are the independent variables

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models 9 2 4 LARGE SAMPLE PROPERTIES OF THE NONLINEAR LEAST SQUARES ESTIMATOR

    167

    Numerous analytical results have been obtained for the nonlinear least squares estimator such as consistency and asymptotic normality We cannot be sure that nonlinear least squares is the most ef cient estimator except in the case of normally distributed disturbances This conclusion is the same one we drew for the linear model But in the semiparametric setting of this chapter we can ask whether this estimator is optimal in some sense given the information that we do have the answer turns out to be yes Some examples that follow will illustrate the points It is necessary to make some assumptions about the regressors The precise requirements are discussed in some detail in Judge et al 1985 Amemiya 1985 and Davidson and MacKinnon 1993 In the linear regression model to obtain our asymptotic results we assume that the sample moment matrix 1 n X X converges to a positive de nite matrix Q By analogy we impose the same condition on the derivatives of the regression function which are called the pseudoregressors in the linearized model when they are computed at the true parameter values Therefore for the nonlinear regression model the analog to 5 1 is 1 1 plim X0 X0 plim n n
    n i 1

    h xi

    h xi

    Q0

    9 9

    where Q0 is a positive de nite matrix To establish consistency of b in the linear model we required plim 1 n X 0 We will use the counterpart to this for the pseudoregressors plim 1 n
    n

    xi0 i 0
    i 1

    This is the orthogonality condition noted earlier in 5 4 In particular note that orthogonality of the disturbances and the data is not the same condition Finally asymptotic normality can be established under general conditions if 1 n
    n i 1

    xi0 i N 0 2 Q0

    d

    With these in hand the asymptotic properties of the nonlinear least squares estimator have been derived They are in fact essentially those we have already seen for the linear model except that in this case we place the derivatives of the linearized function evaluated at X0 in the role of the regressors Amemiya 1985 The nonlinear least squares criterion function is S b 1 2
    n

    yi h xi b 2
    i 1

    1 2

    n

    ei2
    i 1

    9 10

    where we have inserted what will be the solution value b The values of the parameters that minimize one half of the sum of squared deviations are the nonlinear least squares

    Greene 50240

    book

    June 11 2002

    19 33

    168

    CHAPTER 9 Nonlinear Regression Models

    estimators The rst order conditions for a minimum are
    n

    g b
    i 1

    yi h xi b

    h xi b 0 b

    9 11

    In the linear model of Chapter 2 this produces a set of linear equations the normal equations 3 4 But in this more general case 9 11 is a set of nonlinear equations that do not have an explicit solution Note that 2 is not relevant to the solution nor was it in 3 4 At the solution g b X0 e 0 which is the same as 3 12 for the linear model Given our assumptions we have the following general results

    THEOREM 9 1 Consistency of the Nonlinear Least Squares Estimator If the following assumptions hold a b c The parameter space is containing is compact has no gaps or nonconcave regions For any vector 0 in that parameter space plim 1 n S 0 q 0 a continuous and differentiable function q 0 has a unique minimum at the true parameter vector

    then the nonlinear least squares estimator de ned by 9 10 and 9 11 is consistent We will sketch the proof then consider why the theorem and the proof differ as they do from the apparently simpler counterpart for the linear model The proof notwithstanding the underlying subtleties of the assumptions is straightforward The estimator say b0 minimizes 1 n S 0 If 1 n S 0 is minimized for every n then it is minimized by b0 as n increases without bound We also assumed that the minimizer of q 0 is uniquely If the minimum value of plim 1 n S 0 equals the probability limit of the minimized value of the sum of squares the theorem is proved This equality is produced by the continuity in assumption b

    In the linear model consistency of the least squares estimator could be established based on plim 1 n X X Q and plim 1 n X 0 To follow that approach here we would use the linearized model and take essentially the same result The loose end in that argument would be that the linearized model is not the true model and there remains an approximation In order for this line of reasoning to be valid it must also be either assumed or shown that plim 1 n X0 0 where i h xi minus the Taylor series approximation An argument to this effect appears in Mittelhammer et al 2000 p 190 191

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    169

    THEOREM 9 2 Asymptotic Normality of the Nonlinear Least Squares Estimator If the pseudoregressors de ned in 9 3 are well behaved then b N where 1 Q0 plim X0 X0 n The sample estimate of the asymptotic covariance matrix is Est Asy Var b 2 X0 X0 1 9 12
    a

    2 0 1 Q n

    Asymptotic ef ciency of the nonlinear least squares estimator is dif cult to establish without a distributional assumption There is an indirect approach that is one possibility The assumption of the orthogonality of the pseudoregressors and the true disturbances implies that the nonlinear least squares estimator is a GMM estimator in this context With the assumptions of homoscedasticity and nonautocorrelation the optimal weighting matrix is the one that we used which is to say that in the class of GMM estimators for this model nonlinear least squares uses the optimal weighting matrix As such it is asymptotically ef cient The requirement that the matrix in 9 9 converges to a positive de nite matrix implies that the columns of the regressor matrix X0 must be linearly independent This identi cation condition is analogous to the requirement that the independent variables in the linear model be linearly independent Nonlinear regression models usually involve several independent variables and at rst blush it might seem suf cient to examine the data directly if one is concerned with multicollinearity However this situation is not the case Example 9 5 gives an application

    9 2 5

    COMPUTING THE NONLINEAR LEAST SQUARES ESTIMATOR

    Minimizing the sum of squares is a standard problem in nonlinear optimization that can be solved by a number of methods See Section E 6 The method of Gauss Newton is often used In the linearized regression model if a value of 0 is available then the linear regression model shown in 9 7 can be estimated by linear least squares Once a parameter vector is obtained it can play the role of a new 0 and the computation can be done again The iteration can continue until the difference between successive parameter vectors is small enough to assume convergence One of the main virtues of this method is that at the last iteration the estimate of Q0 1 will apart from the scale factor 2 n provide the correct estimate of the asymptotic covariance matrix for the parameter estimator

    Greene 50240

    book

    June 11 2002

    19 33

    170

    CHAPTER 9 Nonlinear Regression Models

    This iterative solution to the minimization problem is
    n 1 n

    bt 1
    i 1

    xi0 xi0
    n i 1 1

    xi0 yi hi0 xi0 bt
    n

    bt

    xi0 xi0
    i 1 0 0 1 t i 1 0 0

    xi0 yi hi0

    bt X X X e bt

    where all terms on the right hand side are evaluated at bt and e0 is the vector of nonlinear least squares residuals This algorithm has some intuitive appeal as well For each iteration we update the previous parameter estimates by regressing the nonlinear least squares residuals on the derivatives of the regression functions The process will have converged i e the update will be 0 when X0 e0 is close enough to 0 This derivative has a direct counterpart in the normal equations for the linear model X e 0 As usual when using a digital computer we will not achieve exact convergence with X0 e0 exactly equal to zero A useful scale free counterpart to the convergence criterion discussed in Section E 6 5 is e0 X0 X0 X0 1 X0 e0 We note nally that iteration of the linearized regression although a very effective algorithm for many problems does not always work As does Newton s method this algorithm sometimes jumps off to a wildly errant second iterate after which it may be impossible to compute the residuals for the next iteration The choice of starting values for the iterations can be crucial There is art as well as science in the computation of nonlinear least squares estimates See McCullough and Vinod 1999 In the absence of information about starting values a workable strategy is to try the Gauss Newton iteration rst If it fails go back to the initial starting values and try one of the more general algorithms such as BFGS treating minimization of the sum of squares as an otherwise ordinary optimization problem A consistent estimator of 2 is based on the residuals n 1 2 yi h xi b 2 9 13 n
    i 1

    A degrees of freedom correction 1 n K where K is the number of elements in is not strictly necessary here because all results are asymptotic in any event Davidson and MacKinnon 1993 argue that on average 9 13 will underestimate 2 and one should use the degrees of freedom correction Most software in current use for this model does but analysts will want to verify which is the case for the program they are using With this in hand the estimator of the asymptotic covariance matrix for the nonlinear least squares estimator is given in 9 12 Once the nonlinear least squares estimates are in hand inference and hypothesis tests can proceed in the same fashion as prescribed in Chapter 7 A minor problem can arise in evaluating the t of the regression in that the familiar measure R2 1
    n i 1 yi n 2 i 1 ei

    y 2



    9 14

    is no longer guaranteed to be in the range of 0 to 1 It does however provide a useful descriptive measure

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    171

    9 3

    APPLICATIONS We will examine two applications The rst is a nonlinear extension of the consumption function examined in Example 2 1 The Box Cox transformation presented in Section 9 3 2 is a device used to search for functional form in regression
    9 3 1 A Nonlinear Consumption Function

    The linear consumption function analyzed at the beginning of Chapter 2 is a restricted version of the more general consumption function C Y in which equals 1 With this restriction the model is linear If is free to vary however then this version becomes a nonlinear regression The linearized model is C 0 0Y
    0

    0 1 0 Y 0 0 Y ln Y Y
    0 0

    0

    0 Y ln Y
    0

    The nonlinear least squares procedure reduces to iterated regression of 1 h h h 0 0 C 0 C 0 0 Y ln Y on x0 Y 0 Y ln Y Quarterly data on consumption real disposable income and several other variables for 1950 to 2000 are listed in Appendix Table F5 1 We will use these to t the nonlinear consumption function This turns out to be a particularly straightforward estimation problem Iterations are begun at the linear least squares estimates for and and 1 for As shown below the solution is reached in 8 iterations after which any further iteration is merely ne tuning the hidden digits i e those that the analyst would not be reporting to their reader Gradient is the scale free convergence measure noted above Begin NLSQ iterations Linearized regression Iteration 1 Iteration 2 Iteration 3 Iteration 4 Iteration 5 Iteration 6 Iteration 7 Iteration 8 Sum of squares 1536321 88 Sum of squares 1847 1012 Sum of squares 20406917 6 Sum of squares 581703 598 Sum of squares 504403 969 Sum of squares 504403 216 Sum of squares 504403 216 Sum of squares 504403 216 Gradient 996103 930 Gradient 1847 1012 Gradient 19902415 7 Gradient 77299 6342 Gradient 752189847 Gradient 526642396E 04 Gradient 511324981E 07 Gradient 606793426E 10

    The linear and nonlinear least squares regression results are shown in Table 9 1 Finding the starting values for a nonlinear procedure can be dif cult Simply trying a convenient set of values can be unproductive Unfortunately there are no good rules for starting values except that they should be as close to the nal values as possible not particularly helpful When it is possible an initial consistent estimator of will be a good starting value In many cases however the only consistent estimator available

    Greene 50240

    book

    June 11 2002

    19 33

    172

    CHAPTER 9 Nonlinear Regression Models

    TABLE 9 1

    Estimated Consumption Functions
    Linear Model Nonlinear Model Estimate Standard Error Estimate Standard Error

    Parameter

    ee R2 Var b Var c Cov b c

    80 3547 0 9217 1 0000

    14 3059 0 003872 1 536 321 881 87 20983 996448

    458 7990 22 5014 0 10085 01091 1 24483 01205 504 403 1725 50 0946 998834 0 000119037 0 00014532 0 000131491

    is the one we are trying to compute by least squares For better or worse trial and error is the most frequently used procedure For the present model a natural set of values can be obtained because a simple linear model is a special case Thus we can start and at the linear least squares values that would result in the special case of 1 and use 1 for the starting value for The procedures outlined earlier are used at the last iteration to obtain the asymptotic standard errors and an estimate of 2 To make this comparable to s 2 in the linear model the value includes the degrees of freedom correction The estimates for the linear model are shown in Table 9 1 as well Eight iterations are required for convergence The value of is shown at the right Note that the coef cient vector takes a very errant step after the rst iteration the sum of squares becomes huge but the iterations settle down after that and converge routinely For hypothesis testing and con dence intervals the usual procedures can be used with the proviso that all results are only asymptotic As such for testing a restriction the chi squared statistic rather than the F ratio is likely to be more appropriate For example for testing the hypothesis that is different from 1 an asymptotic t test based on the standard normal distribution is carried out using z 1 24483 1 20 3178 0 01205

    This result is larger than the critical value of 1 96 for the 5 percent signi cance level and we thus reject the linear model in favor of the nonlinear regression We are also interested in the marginal propensity to consume In this expanded model H0 1 is a test that the marginal propensity to consume is constant not that it is 1 That would be a joint test of both 1 and 1 In this model the marginal propensity to consume is MPC dc Y 1 dY

    which varies with Y To test the hypothesis that this value is 1 we require a particular value of Y Since it is the most recent value we choose DPI2000 4 6634 9 At this value the MPC is estimated as 1 08264 We estimate its standard error using the delta method

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    173

    with the square root of MPC b MPC c cYc 1 Var b Cov b c Cov b c Var c MPC b MPC c 0 000131491 0 00014532 cYc 1 bYc 1 1 c ln Y

    bYc 1 1 c ln Y

    0 00011904 0 000131491

    0 00007469 which gives a standard error of 0 0086425 For testing the hypothesis that the MPC is equal to 1 0 in 2000 4 we would refer 1 08264 1 9 562 0 0086425 to a standard normal table This difference is certainly statistically signi cant so we would reject the hypothesis z
    Example 9 5

    In the preceding example there is no question of collinearity in the data matrix X i y the variation in Y is obvious on inspection But at the nal parameter estimates the R2 in the 0 regression is 0 999312 and the correlation between the two pseudoregressors x2 Y and 0 x3 Y ln Y is 0 999752 The condition number for the normalized matrix of sums of squares and cross products is 208 306 The condition number is computed by computing the square 0 root of the ratio of the largest to smallest characteristic root of D 1 X0 X0 D 1 where x1 1 and D is the diagonal matrix containing the square roots of x0 x0 on the diagonal Recall kk that 20 was the benchmark value for a problematic data set By the standards discussed in Section 4 9 1 the collinearity problem in this data set is severe
    THE BOX COX TRANSFORMATION

    Multicollinearity in Nonlinear Regression

    9 3 2

    The Box Cox transformation is a device for generalizing the linear model The transformation is3 x 1 In a regression model the analysis can be done conditionally For a given value of the model x
    K

    y
    k 1

    k xk

    9 15

    is a linear regression that can be estimated by least squares 4 In principle each regressor could be transformed by a different value of but in most applications this level of generality becomes excessively cumbersome and is assumed to be the same for all the variables in the model 5 At the same time it is also possible to transform y say by
    3 Box 4 In

    and Cox 1964 To be de ned for all values of x must be strictly positive See also Zarembka 1974

    most applications some of the regressors for example dummy variable will not be transformed For such a variable say k k k and the relevant derivatives in 9 16 will be zero
    5 See

    for example Seaks and Layson 1983

    Greene 50240

    book

    June 11 2002

    19 33

    174

    CHAPTER 9 Nonlinear Regression Models

    y Transformation of the dependent variable however amounts to a speci cation of the whole model not just the functional form We will examine this case more closely in Section 17 6 2
    Example 9 6 Flexible Cost Function

    Caves Christensen and Trethaway 1980 analyzed the costs of production for railroads providing freight and passenger service Continuing a long line of literature on the costs of production in regulated industries a translog cost function see Section 14 3 2 would be a natural choice for modeling this multiple output technology Several of the rms in the study however produced no passenger service which would preclude the use of the translog model This model would require the log of zero An alternative is the Box Cox transformation which is computable for zero output levels A constraint must still be placed on in their model as 0 is de ned only if is strictly positive A positive value of is not assured A question does arise in this context and other similar ones as to whether zero outputs should be treated the same as nonzero outputs or whether an output of zero represents a discrete corporate decision distinct from other variations in the output levels In addition as can be seen in 9 16 this solution is only partial The zero values of the regressors preclude computation of appropriate standard errors

    If in 9 15 is taken to be an unknown parameter then the regression becomes nonlinear in the parameters Although no transformation will reduce it to linearity nonlinear least squares is straightforward In most instances we can expect to nd the least squares value of between 2 and 2 Typically then is estimated by scanning this range for the value that minimizes the sum of squares When equals zero the transformation is by L Hopital s rule
    0

    lim

    x 1 d x 1 d lim lim x ln x ln x 0 0 1

    Once the optimal value of is located the least squares estimates the mean squared residual and this value of constitute the nonlinear least squares and with normality of the disturbance maximum likelihood estimates of the parameters After determining the optimal value of it is sometimes treated as if it were a known value in the least squares results But is an estimate of an unknown parameter It is not hard to show that the least squares standard errors will always underestimate the correct asymptotic standard errors 6 To get the appropriate values we need the derivatives of the right hand side of 9 15 with respect to and In the notation of Section 9 2 3 these are h 1 h xk k h
    K

    9 16
    xk K

    k
    k 1

    k
    k 1

    1 x ln xk xk k



    6 See

    Fomby Hill and Johnson 1984 pp 426 431

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    175

    We can now use 9 12 and 9 13 to estimate the asymptotic covariance matrix of the parameter estimates Note that ln xk appears in h If xk 0 then this matrix cannot be computed This was the point noted at the end of Example 9 6 It is important to remember that the coef cients in a nonlinear model are not equal to the slopes i e here the demand elasticities with respect to the variables For the Box Cox model 7 ln Y X 1

    dE ln Y X X d ln X

    9 17

    Standard errors for these estimates can be obtained using the delta method The derivatives are and ln X Collecting terms we obtain Asy Var 2 Asy Var ln X 2 Asy Var 2 ln X Asy Cov

    9 4

    HYPOTHESIS TESTING AND PARAMETRIC RESTRICTIONS In most cases the sorts of hypotheses one would test in this context will involve fairly simple linear restrictions The tests can be carried out using the usual formulas discussed in Chapter 7 and the asymptotic covariance matrix presented earlier For more involved hypotheses and for nonlinear restrictions the procedures are a bit less clear cut Three principal testing procedures were discussed in Section 6 4 and Appendix C the Wald likelihood ratio and Lagrange multiplier tests For the linear model all three statistics are transformations of the standard F statistic see Section 17 6 1 so the tests are essentially identical In the nonlinear case they are equivalent only asymptotically We will work through the Wald and Lagrange multiplier tests for the general case and then apply them to the example of the previous section Since we have not assumed normality of the disturbances yet we will postpone treatment of the likelihood ratio statistic until we revisit this model in Chapter 17
    9 4 1 SIGNIFICANCE TESTS FOR RESTRICTIONS F AND WALD STATISTICS

    The hypothesis to be tested is H0 r q 9 18

    where r is a column vector of J continuous functions of the elements of These restrictions may be linear or nonlinear It is necessary however that they be overidentifying restrictions Thus in formal terms if the original parameter vector has K free elements then the hypothesis r q must impose at least one functional relationship
    7 We

    have used the result d ln Y d ln X Xd ln Y dX

    Greene 50240

    book

    June 11 2002

    19 33

    176

    CHAPTER 9 Nonlinear Regression Models

    on the parameters If there is more than one restriction then they must be functionally independent These two conditions imply that the J K matrix R r 9 19

    must have full row rank and that J the number of restrictions must be strictly less than K This situation is analogous to the linear model in which R would be the matrix of coef cients in the restrictions Let b be the unrestricted nonlinear least squares estimator and let b be the estimator obtained when the constraints of the hypothesis are imposed 8 Which test statistic one uses depends on how dif cult the computations are Unlike the linear model the various testing procedures vary in complexity For instance in our example the Lagrange multiplier is by far the simplest to compute Of the four methods we will consider only this test does not require us to compute a nonlinear regression The nonlinear analog to the familiar F statistic based on the t of the regression i e the sum of squared residuals would be F J n K S b S b J S b n K 9 20

    This equation has the appearance of our earlier F ratio In the nonlinear setting however neither the numerator nor the denominator has exactly the necessary chi squared distribution so the F distribution is only approximate Note that this F statistic requires that both the restricted and unrestricted models be estimated The Wald test is based on the distance between r b and q If the unrestricted estimates fail to satisfy the restrictions then doubt is cast on the validity of the restrictions The statistic is W r b q Est Asy Var r b q r b q R b VR b where V Est Asy Var b and R b is evaluated at b the estimate of Under the null hypothesis this statistic has a limiting chi squared distribution with J degrees of freedom If the restrictions are correct the Wald statistic and J times the F statistic are asymptotically equivalent The Wald statistic can be based on the estimated covariance matrix obtained earlier using the unrestricted estimates which may provide a large savings in computing effort if the restrictions are nonlinear It should be noted that the small sample behavior of W can be erratic and the more conservative F statistic may be preferable if the sample is not large The caveat about Wald statistics that applied in the linear case applies here as well Because it is a pure signi cance test that does not involve the alternative hypothesis the
    8 This

    1

    r b q

    1

    r b q

    9 21

    computational problem may be extremely dif cult in its own right especially if the constraints are nonlinear We assume that the estimator has been obtained by whatever means are necessary

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    177

    Wald statistic is not invariant to how the hypothesis is framed In cases in which there are more than one equivalent ways to specify r q W can give different answers depending on which is chosen
    9 4 2 TESTS BASED ON THE LM STATISTIC

    The Lagrange multiplier test is based on the decrease in the sum of squared residuals that would result if the restrictions in the restricted model were released The formalities of the test are given in Sections 17 5 3 and 17 6 1 For the nonlinear regression model the test has a particularly appealing form 9 Let e be the vector of residuals yi h xi b computed using the restricted estimates Recall that we de ned X0 as an n K matrix of derivatives computed at a particular parameter vector in 9 9 Let X0 be this ma trix computed at the restricted estimates Then the Lagrange multiplier statistic for the nonlinear regression model is LM e X0 X0 X0 1 X0 e e e n 9 22

    Under H0 this statistic has a limiting chi squared distribution with J degrees of freedom What is especially appealing about this approach is that it requires only the restricted estimates This method may provide some savings in computing effort if as in our example the restrictions result in a linear model Note also that the Lagrange multiplier statistic is n times the uncentered R2 in the regression of e on X0 Many Lagrange multiplier statistics are computed in this fashion
    Example 9 7

    We test the hypothesis H 0 1 in the consumption function of Section 9 3 1 F statistic The F statistic is F 1 204 3 1 536 321 881 504 403 57 1 411 29 504 403 57 204 3

    Hypotheses Tests in a Nonlinear Regression Model





    The critical value from the tables is 4 18 so the hypothesis is rejected Wald statistic For our example the Wald statistic is based on the distance of from 1 and is simply the square of the asymptotic t ratio we computed at the end of the example W 1 244827 1 2 412 805 0 012052



    The critical value from the chi squared table is 3 84 Lagrange multiplier For our example the elements in xi are xi 1 Y Y ln Y To compute this at the restricted estimates we use the ordinary least squares estimates for and and 1 for so that xi 1 Y Y ln Y

    9 This

    test is derived in Judge et al 1985 A lengthy discussion appears in Mittelhammer et al 2000

    Greene 50240

    book

    June 11 2002

    19 33

    178

    CHAPTER 9 Nonlinear Regression Models

    The residuals are the least squares residuals computed from the linear regression Inserting the values given earlier we have LM 996 103 9 132 267 1 536 321 881 204

    As expected this statistic is also larger than the critical value from the chi squared table
    9 4 3 A SPECIFICATION TEST FOR NONLINEAR REGRESSIONS THE PE TEST

    MacKinnon White and Davidson 1983 have extended the J test discussed in Section 8 3 3 to nonlinear regressions One result of this analysis is a simple test for linearity versus loglinearity The speci c hypothesis to be tested is H0 y h0 x 0 versus H1 g y h1 z 1 where x and z are regressor vectors and and are the parameters As the authors note using y instead of say j y in the rst function is nothing more than an implicit de nition of the units of measurement of the dependent variable An intermediate case is useful If we assume that g y is equal to y but we allow h0 and h1 to be nonlinear then the necessary modi cation of the J test is straightforward albeit perhaps a bit more dif cult to carry out For this case we form the compound model y 1 h0 x h1 z 9 23 h0 x h1 z h0 x Presumably both and could be estimated in isolation by nonlinear least squares Suppose that a nonlinear least squares estimate of has been obtained One approach is to insert this estimate in 9 23 and then estimate and by nonlinear least squares The J test amounts to testing the hypothesis that equals zero Of course the model is symmetric in h0 and h1 so their roles could be reversed The same conclusions drawn earlier would apply here Davidson and MacKinnon 1981 propose what may be a simpler alternative Given an estimate of say approximate h0 x with a linear Taylor series at this point The result is h0 x h0 x h0 h0 H0 H0 9 24

    Using this device they replace 9 23 with y h0 H0 b h1 z h0 x e in which b and can be estimated by linear least squares As before the J test amounts to testing the signi cance of If it is found that is signi cantly different from zero then H0 is rejected For the authors asymptotic results to hold any consistent estimator

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    179

    of will suf ce for the nonlinear least squares estimator that they suggest seems a 10 natural choice Now we can generalize the test to allow a nonlinear function g y in H1 Davidson and MacKinnon require g y to be monotonic continuous and continuously differentiable and not to introduce any new parameters This requirement excludes the Box Cox model which is considered in Section 9 3 2 The compound model that forms the basis of the test is 1 y h0 x g y h1 z 9 25

    Again there are two approaches As before if is an estimate of then and can be estimated by maximum likelihood conditional on this estimate 11 This method promises to be extremely messy and an alternative is proposed Rewrite 9 25 as y h0 x h1 z g y y h0 x Now use the same linear Taylor series expansion for h0 x on the left hand side and replace both y and h0 x with h0 on the right The resulting model is y h0 H0 b h1 g h0 e 9 26

    As before with an estimate of this model can be estimated by least squares This modi ed form of the J test is labeled the PE test As the authors discuss it is probably not as powerful as any of the Wald or Lagrange multiplier tests that we have considered In their experience however it has suf cient power for applied research and is clearly simple to carry out The PE test can be used to test a linear speci cation against a loglinear model For this test both h0 and h1 are linear whereas g y ln y Let the two competing models be denoted H0 y x and H1 ln y ln x We stretch the usual notational conventions by using ln x for ln x1 ln xk Now let b and c be the two linear least squares estimates of the parameter vectors The PE test for H1 as an alternative to H0 is carried out by testing the signi cance of the coef cient in the model y x ln y ln x b 9 27

    The second term is the difference between predictions of ln y obtained directly from the loglinear model and obtained as the log of the prediction from the linear model We can also reverse the roles of the two formulas and test H0 as the alternative The
    10 This

    procedure assumes that H0 is correct of course

    11 Least

    squares will be inappropriate because of the transformation of y which will translate to a Jacobian term in the log likelihood See the later discussion of the Box Cox model

    Greene 50240

    book

    June 11 2002

    19 33

    180

    CHAPTER 9 Nonlinear Regression Models

    TABLE 9 2

    Estimated Money Demand Equations
    a br cY R2 s

    228 714 23 849 0 1770 13 891 2 044 0 00278 PE test for the linear model 121 496 46 353 t 2 621 Linear Loglinear

    0 95548

    76 277

    8 9473 0 2590 1 8205 0 96647 0 2181 0 0236 0 0289 PE test for the loglinear model 0 0003786 0 0001969 t 1 925

    0 14825

    compound regression is ln y ln x y eln x c 9 28

    The test of linearity vs loglinearity has been the subject of a number of studies Godfrey and Wickens 1982 discuss several approaches
    Example 9 8 Money Demand

    A large number of studies have estimated money demand equations some linear and some log linear 12 Quarterly data from 1950 to 2000 for estimation of a money demand equation are given in Appendix Table F5 1 The interest rate is the quarterly average of the monthly average 90 day T bill rate The money stock is M1 Real GDP is seasonally adjusted and stated in 1996 constant dollars Results of the PE test of the linear versus the loglinear model are shown in Table 9 2 Regressions of M on a constant r and Y and ln M on a constant ln r and ln Y produce the results given in Table 9 2 standard errors are given in parentheses Both models appear to t quite well 13 and the pattern of signi cance of the coef cients is the same in both equations After computing tted values from the two equations the estimates of from the two models are as shown in Table 9 2 Referring these to a standard normal table we reject the linear model in favor of the loglinear model

    9 5

    ALTERNATIVE ESTIMATORS FOR NONLINEAR REGRESSION MODELS Section 9 2 discusses the standard case in which the only complication to the classical regression model of Chapter 2 is that the conditional mean function in yi h xi i is a nonlinear function of This fact mandates an alternative estimator nonlinear least squares and some new interpretation of the regressors in the model In this section we will consider two extensions of these results First as in the linear case there can be situations in which the assumption that Cov xi i 0 is not reasonable These situations will as before require an instrumental variables treatment which we consider in Section 9 5 1 Second there will be models in which it is convenient to estimate the parameters in two steps estimating one subset at the rst step and then using these estimates in a second step at which the remaining parameters are estimated
    12 A

    comprehensive survey appears in Goldfeld 1973 interest elasticity is in line with the received results The income elasticity is quite a bit larger

    13 The

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    181

    We will have to modify our asymptotic results somewhat to accommodate this estimation strategy The two step estimator is discussed in Section 9 5 2
    9 5 1 NONLINEAR INSTRUMENTAL VARIABLES ESTIMATION

    In Section 5 4 we extended the linear regression model to allow for the possibility that the regressors might be correlated with the disturbances The same problem can arise in nonlinear models The consumption function estimated in Section 9 3 1 is almost surely a case in point and we reestimated it using the instrumental variables technique for linear models in Example 5 3 In this section we will extend the method of instrumental variables to nonlinear regression models In the nonlinear model yi h xi i the covariates xi may be correlated with the disturbances We would expect this effect to be transmitted to the pseudoregressors xi0 h xi If so then the results that we derived for the linearized regression would no longer hold Suppose that there is a set of variables z1 z L such that plim 1 n Z 0 and plim 1 n Z X0 Q0 0 zx where X0 is the matrix of pseudoregressors in the linearized regression evaluated at the true parameter values If the analysis that we did for the linear model in Section 5 4 can be applied to this set of variables then we will be able to construct a consistent estimator for using the instrumental variables As a rst step we will attempt to replicate the approach that we used for the linear model The linearized regression model is given in 9 7 y h X h0 X0 0 or y0 X0 where y0 y h0 X0 0 For the moment we neglect the approximation error in linearizing the model In 9 29 we have assumed that plim 1 n Z y0 plim 1 n Z X0 9 30 9 29

    Suppose as we did before that there are the same number of instrumental variables as there are parameters that is columns in X0 Note This number need not be the number of variables See our preceding example Then the estimator used before is suggested bIV Z X0 1 Z y0 9 31

    Greene 50240

    book

    June 11 2002

    19 33

    182

    CHAPTER 9 Nonlinear Regression Models

    The logic is sound but there is a problem with this estimator The unknown parameter vector appears on both sides of 9 30 We might consider the approach we used for our rst solution to the nonlinear regression model That is with some initial estimator in hand iterate back and forth between the instrumental variables regression and recomputing the pseudoregressors until the process converges to the xed point that we seek Once again the logic is sound and in principle this method does produce the estimator we seek If we add to our preceding assumptions 1 d Z N 0 2 Qzz n then we will be able to use the same form of the asymptotic distribution for this estimator that we did for the linear case Before doing so we must ll in some gaps in the preceding First despite its intuitive appeal the suggested procedure for nding the estimator is very unlikely to be a good algorithm for locating the estimates Second we do not wish to limit ourselves to the case in which we have the same number of instrumental variables as parameters So we will consider the problem in general terms The estimation criterion for nonlinear instrumental variables is a quadratic form Min S
    1 2

    y h X Z Z Z 1 Z y h X

    1 Z Z Z 1 Z 2 The rst order conditions for minimization of this weighted sum of squares are S X0 Z Z Z 1 Z 0 This result is the same one we had for the linear model with X0 in the role of X You should check that when y X our results for the linear model in Section 9 5 1 are replicated exactly This problem however is highly nonlinear in most cases and the repeated least squares approach is unlikely to be effective But it is a straightforward minimization problem in the frameworks of Appendix E and instead we can just treat estimation here as a problem in nonlinear optimization We have approached the formulation of this instrumental variables estimator more or less strategically However there is a more structured approach The orthogonality condition plim 1 n Z 0 de nes a GMM estimator With the homoscedasticity and nonautocorrelation assumption the resultant minimum distance estimator produces precisely the criterion function suggested above We will revisit this estimator in this context in Chapter 18 With well behaved pseudoregressors and instrumental variables we have the general result for the nonlinear instrumental variables estimator this result is discussed at length in Davidson and MacKinnon 1993

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    183

    THEOREM 9 3 Asymptotic Distribution of the Nonlinear Instrumental Variables Estimator With well behaved instrumental variables and pseudoregressors bIV N 2 Q0 Qzz 1 Q0 xz zx We estimate the asymptotic covariance matrix with Est Asy Var bIV 2 X0 Z Z Z 1 Z X0 1 where X0 is X0 computed using bIV
    a 1



    As a nal observation note that the two stage least squares interpretation of the instrumental variables estimator for the linear model still applies here with respect to the IV estimator That is at the nal estimates the rst order conditions normal equations imply that X0 Z Z Z 1 Z y X0 Z Z Z 1 Z X0 which says that the estimates satisfy the normal equations for a linear regression of y not y0 on the predictions obtained by regressing the columns of X0 on Z The interpretation is not quite the same here because to compute the predictions of X0 we must have the estimate of in hand Thus this two stage least squares approach does not show how to compute bIV it shows a characteristic of bIV
    Example 9 9 Instrumental Variables Estimates of the Consumption Function

    The consumption function in Section 9 3 1 was estimated by nonlinear least squares without accounting for the nature of the data that would certainly induce correlation between X0 and As we did earlier we will reestimate this model using the technique of instrumental variables For this application we will use the one period lagged value of consumption and one and two period lagged values of income as instrumental variables estimates Table 9 3 reports the nonlinear least squares and instrumental variables estimates Since we are using two periods of lagged values two observations are lost Thus the least squares estimates are not the same as those reported earlier The instrumental variable estimates differ considerably from the least squares estimates The differences can be deceiving however Recall that the MPC in the model is Y 1 The 2000 4 value for DPI that we examined earlier was 6634 9 At this value the instrumental variables and least squares estimates of the MPC are 0 8567 with an estimated standard error of 0 01234 and 1 08479 with an estimated standard error of 0 008694 respectively These values do differ a bit but less than the quite large differences in the parameters might have led one to expect We do note that both of these are considerably greater than the estimate in the linear model 0 9222 and greater than one which seems a bit implausible

    9 5 2

    TWO STEP NONLINEAR LEAST SQUARES ESTIMATION

    In this section we consider a special case of this general class of models in which the nonlinear regression model depends on a second set of parameters that is estimated separately

    Greene 50240

    book

    June 11 2002

    19 33

    184

    CHAPTER 9 Nonlinear Regression Models

    TABLE 9 3

    Nonlinear Least Squares and Instrumental Variable Estimates
    Instrumental Variables Least Squares Estimate Standard Error Estimate Standard Error

    Parameter

    ee

    627 031 0 040291 1 34738 57 1681 650 369 805

    26 6063 0 006050 0 016816

    468 215 0 0971598 1 24892 49 87998 495 114 490

    22 788 0 01064 0 1220

    The model is y h x w We consider cases in which the auxiliary parameter is estimated separately in a model that depends on an additional set of variables w This rst step might be a least squares regression a nonlinear regression or a maximum likelihood estimation The parameters will usually enter h through some function of and w such as an expectation The second step then consists of a nonlinear regression of y on h x w c in which c is the rst round estimate of To put this in context we will develop an example The estimation procedure is as follows 1 Estimate by least squares nonlinear least squares or maximum likelihood We assume that this estimator however obtained denoted c is consistent and asymp totically normally distributed with asymptotic covariance matrix Vc Let Vc be any appropriate estimator of Vc Estimate by nonlinear least squares regression of y on h x w c Let 2 Vb be the asymptotic covariance matrix of this estimator of assuming is known and let s 2 Vb be any appropriate estimator of 2 Vb 2 X0 X0 1 where X0 is the matrix of pseudoregressors evaluated at the true parameter values xi0 h xi wi

    2

    The argument for consistency of b is based on the Slutsky Theorem D 12 as we treat b as a function of c and the data We require as usual well behaved pseudoregressors As long as c is consistent for the large sample behavior of the estimator of conditioned on c is the same as that conditioned on that is as if were known Asymptotic normality is obtained along similar lines albeit with greater dif culty The asymptotic covariance matrix for the two step estimator is provided by the following theorem

    THEOREM 9 4 Asymptotic Distribution of the Two Step Nonlinear Least Squares Estimator Murphy and Topel 1985 Under the standard conditions assumed for the nonlinear least squares estimator the second step estimator of is consistent and asymptotically normally distributed with asymptotic covariance matrix V 2 Vb Vb CVc C CVc R RVc C Vb b

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    185

    THEOREM 9 4 Continued where C n plim and R n plim 1 n
    n

    1 n

    n

    xi0 i2
    i 1

    h xi wi g wi

    xi0 i
    i 1



    The function g in the de nition of R is the gradient of the ith term in the log likelihood function if is estimated by maximum likelihood The precise form is shown below If appears as the parameter vector in a regression model zi f wi ui g f wi ui If this is a linear regression then the derivative vector is just wi 9 32 then g will be a derivative of the sum of squared deviations function

    Implementation of the theorem requires that the asymptotic covariance matrix computed as usual for the second step estimator based on c instead of the true must be corrected for the presence of the estimator c in b Before developing the application we note how some important special cases are handled If enters h as the coef cient vector in a prediction of another variable in a regression model then we have the following useful results Case 1 Linear regression models If h xi E zi wi i where E zi wi wi then the two models are just t by linear least squares as usual The regression for y includes an additional variable wi c Let d be the coef cient on this new variable Then
    n

    C d
    i 1

    ei2 xi wi

    and
    n

    R
    i 1

    ei ui xi wi In Case 1 if the two regression distur

    Case 2 Uncorrelated linear regression models bances are uncorrelated then R 0

    Case 2 is general The terms in R vanish asymptotically if the regressions have uncorrelated disturbances whether either or both of them are linear This situation will be quite common

    Greene 50240

    book

    June 11 2002

    19 33

    186

    CHAPTER 9 Nonlinear Regression Models

    Case 3 Prediction from a nonlinear model In Cases 1 and 2 if E zi wi is a nonlinear function rather than a linear function then it is only necessary to change wi to wi0 E zi wi a vector of pseudoregressors in the de nitions of C and R Case 4 Subset of regressors In case 2 but not in case 1 if w contains all the variables that are in x then the appropriate estimator is simply
    2 V se 1 b 2 c 2 su 2 se

    X X 1

    where X includes all the variables in x as well as the prediction for z All these cases carry over to the case of a nonlinear regression function for y It is only necessary to replace xi the actual regressors in the linear model with xi0 the pseudoregressors
    9 5 3 TWO STEP ESTIMATION OF A CREDIT SCORING MODEL

    Greene 1995c estimates a model of consumer behavior in which the dependent variable of interest is the number of major derogatory reports recorded in the credit history of a sample of applicants for a type of credit card In fact this particular variable is one of the most signi cant determinants of whether an application for a loan or a credit card will be accepted This dependent variable y is a discrete variable that at any time for most consumers will equal zero but for a signi cant fraction who have missed several revolving credit payments it will take a positive value The typical values are zero one or two but values up to say 10 are not unusual This count variable is modeled using a Poisson regression model This model appears in Sections B 4 8 22 2 1 22 3 7 and 21 9 The probability density function for this discrete random variable is Prob yi j e i i j
    j

    The expected value of yi is i so depending on how i is speci ed and despite the unusual nature of the dependent variable this model is a linear or nonlinear regression model We will consider both cases the linear model E yi xi xi and the more common loglinear model E yi xi exi where xi might include such covariates as age income and typical monthly credit account expenditure This model is usually estimated by maximum likelihood But since it is a bona de regression model least squares either linear or nonlinear is a consistent if inef cient estimator In Greene s study a secondary model is t for the outcome of the credit card application Let zi denote this outcome coded 1 if the application is accepted 0 if not For purposes of this example we will model this outcome using a logit model see the extensive development in Chapter 21 esp Section 21 3 Thus Prob zi 1 P wi ewi 1 ewi

    where wi might include age income whether the applicants own their own homes and whether they are self employed these are the sorts of variables that credit scoring agencies examine

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    187

    Finally we suppose that the probability of acceptance enters the regression model as an additional explanatory variable We concede that the power of the underlying theory wanes a bit here Thus our nonlinear regression model is E yi xi xi P wi linear or E yi xi exi P wi loglinear nonlinear The two step estimation procedure consists of estimation of by maximum likelihood then computing Pi P wi c and nally estimating by either linear or nonlinear least squares using Pi as a constructed regressor We will develop the theoretical background for the estimator and then continue with implementation of the estimator For the Poisson regression model when the conditional mean function is linear xi0 xi If it is loglinear then xi0 i exp xi i xi which is simple to compute When P wi is included in the model the pseudoregressor vector xi0 includes this variable and the coef cient vector is Then 1 Vb n
    0 n

    yi h xi wi b c 2 X0 X0 1
    i 1

    where X is computed at b d c the nal estimates For the logit model the gradient of the log likelihood and the estimator of Vc are given in Section 21 3 1 They are ln f zi wi zi P wi wi and
    n 1

    Vc
    i 1

    zi P wi wi wi
    2



    Note that for this model we are actually inserting a prediction from a regression model of sorts since E zi wi P wi To compute C we will require h i Pi i Pi 1 Pi wi The remaining parts of the corrected covariance matrix are computed using
    n

    C
    i 1

    i xi0 i2 i d Pi 1 Pi wi

    and
    n

    R
    i 1

    i xi0 i zi Pi wi

    If the regression model is linear then the three occurrences of i are omitted

    Greene 50240

    book

    June 11 2002

    19 33

    188

    CHAPTER 9 Nonlinear Regression Models

    TABLE 9 4

    Two Step Estimates of a Credit Scoring Model
    Step 2 E yi xi xi Pi Est St Er St Er Step 2 E yi xi exi Pi Est St Er Se Er

    Step 1 P wi Variable Est St Er

    Constant 2 7236 1 0970 Age 0 7328 0 02961 Income 0 21919 0 14296 Self empl 1 9439 1 01270 Own Rent 0 18937 0 49817 Expend P wi ln L 53 925 ee s R2 Mean 0 73

    1 0628 1 1907 1 2681 0 021661 0 018756 0 020089 0 03473 0 07266 0 082079 0 000787 0 000368 0 000413 1 0408 1 0653 1 177299 95 5506 0 977496 0 05433 0 36

    7 1969 6 2708 49 3854 0 079984 0 08135 0 61183 0 1328007 0 21380 1 8687 0 28008 6 99098 80 31265 0 89617 0 20514 0 36 0 96429 0 96969 5 7978 49 34414

    Data used in the application are listed in Appendix Table F9 1 We use the following model Prob zi 1 P age income own rent self employed E yi h age income expend We have used 100 of the 1 319 observations used in the original study Table 9 4 reports the results of the various regressions and computations The column denoted St Er contains the corrected standard error The column marked St Er contains the standard errors that would be computed ignoring the two step nature of the computations For the linear model we used e e n to estimate 2 As expected accounting for the variability in c increases the standard errors of the second step estimator The linear model appears to give quite different results from the nonlinear model But this can be deceiving In the linear model E yi xi Pi xi whereas in the nonlinear model the counterpart is not but i The value of i at the mean values of all the variables in the second step model is roughly 0 36 the mean of the dependent variable so the marginal effects in the nonlinear model are 0 0224 0 0372 0 07847 1 9587 respectively including Pi but not the constant which are reasonably similar to those for the linear model To compute an asymptotic covariance matrix for the estimated marginal effects we would use the delta method from Sections D 2 7 and D 3 1 For convenience let b p b d and let vi xi Pi which just adds Pi to the regressor vector so we need not treat it separately Then the vector of marginal effects is m exp vi b p b p i b p The matrix of derivatives is G m b p i I b p vi so the estimator of the asymptotic covariance matrix for m is Est Asy Var m GV G b

    Greene 50240

    book

    June 11 2002

    19 33

    CHAPTER 9 Nonlinear Regression Models

    189

    TABLE 9 5

    Maximum Likelihood Estimates of Second Step Regression Model
    Constant Age Income Expend P

    Estimate Std Error Corr Std Error

    6 3200 3 9308 9 0321

    0 073106 0 054246 0 102867

    0 045236 0 17411 0 402368

    0 00689 0 00202 0 003985

    4 6324 3 6618 9 918233

    One might be tempted to treat i as a constant in which case only the rst term in the quadratic form would appear and the computation would amount simply to multiplying the asymptotic standard errors for b p by i This approximation would leave the asymptotic t ratios unchanged whereas making the full correction will change the entire covariance matrix The approximation will generally lead to an understatement of the correct standard errors Finally although this treatment is not discussed in detail until Chapter 18 we note at this point that nonlinear least squares is an inef cient estimator in the Poisson regression model maximum likelihood is the preferred ef cient estimator Table 9 5 presents the maximum likelihood estimates with both corrected and uncorrected estimates of the asymptotic standard errors of the parameter estimates The full discussion of the model is given in Section 21 9 The corrected standard errors are computed using the methods shown in Section 17 7 A comparison of these estimates with those in the third set of Table 9 4 suggests the clear superiority of the maximum likelihood estimator

    9 6

    SUMMARY AND CONCLUSIONS In this chapter we extended the regression model to a form which allows nonlinearity in the parameters in the regression function The results for interpretation estimation and hypothesis testing are quite similar to those for the linear model The two crucial differences between the two models are rst the more involved estimation procedures needed for the nonlinear model and second the ambiguity of the interpretation of the coef cients in the nonlinear model since the derivatives of the regression are often nonconstant in contrast to those in the linear model Finally we added two additional levels of generality to the model A nonlinear instrumental variables estimator is suggested to accommodate the possibility that the disturbances in the model are correlated with the included variables In the second application two step nonlinear least squares is suggested as a method of allowing a model to be t while including functions of previously estimated parameters

    Key Terms and Concepts
    Box Cox transformation Consistency Delta method GMM estimator Identi cation Instrumental variables Linearized regression model LM test Logit Multicollinearity Nonlinear model Normalization Orthogonality condition Overidentifying restrictions PE test Pseudoregressors Semiparametric Starting values Translog Two step estimation Wald test

    estimator
    Iteration

    Greene 50240

    book

    June 11 2002

    19 33

    190

    CHAPTER 9 Nonlinear Regression Models

    Exercises 1 2 Describe how to obtain nonlinear least squares estimates of the parameters of the model y x Use MacKinnon White and Davidson s PE test to determine whether a linear or loglinear production model is more appropriate for the data in Appendix Table F6 1 The test is described in Section 9 4 3 and Example 9 8 Using the Box Cox transformation we may specify an alternative to the Cobb Douglas model as K 1 L 1 l Using Zellner and Revankar s data in Appendix Table F9 2 estimate k l and by using the scanning method suggested in Section 9 3 2 Do not forget to scale Y K and L by the number of establishments Use 9 16 9 12 and 9 13 to compute the appropriate asymptotic standard errors for your estimates Compute the two output elasticities ln Y ln K and ln Y ln L at the sample means of K and L Hint ln Y ln K K ln Y K For the model in Exercise 3 test the hypothesis that 0 using a Wald test a likelihood ratio test and a Lagrange multiplier test Note that the restricted model is the Cobb Douglas log linear model To extend Zellner and Revankar s model in a fashion similar to theirs we can use the Box Cox transformation for the dependent variable as well Use the method of Example 17 6 with to repeat the study of the preceding two exercises How do your results change Verify the following differential equation which applies to the Box Cox transformation di x 1 idi 1 x 9 33 x ln x i d i d i 1 ln Y k Show that the limiting sequence for 0 is di x ln x i 1 9 34 0 d i i 1 These results can be used to great advantage in deriving the actual second derivatives of the log likelihood function for the Box Cox model lim

    3

    4

    5

    6

    Greene 50240

    book

    June 11 2002

    18 51

    10

    NONSPHERICAL DISTURBANCES THE GENERALIZED REGRESSION MODEL

    Q
    10 1 INTRODUCTION In Chapter 9 we extended the classical linear model to allow the conditional mean to be a nonlinear function 1 But we retained the important assumptions about the disturbances that they are uncorrelated with each other and that they have a constant variance conditioned on the independent variables In this and the next several chapters we extend the multiple regression model to disturbances that violate these classical assumptions The generalized linear regression model is y X E X 0 E X 2 10 1

    where is a positive de nite matrix The covariance matrix is written in the form 2 at several points so that we can obtain the classical model 2 I as a convenient special case As we will examine brie y below the extension of the model to nonlinearity is relatively minor in comparison with the variants considered here For present purposes we will retain the linear speci cation and refer to our model simply as the generalized regression model Two cases we will consider in detail are heteroscedasticity and autocorrelation Disturbances are heteroscedastic when they have different variances Heteroscedasticity usually arises in volatile high frequency time series data such as daily observations in nancial markets and in cross section data where the scale of the dependent variable and the explanatory power of the model tend to vary across observations Microeconomic data such as expenditure surveys are typical The disturbances are still assumed to be uncorrelated across observations so 2 would be 2 1 0 0 11 0 0 2 0 22 0 0 2 0 2 2 0 0 nn 0 0
    2 n

    1 Recall that our de nition of nonlinearity pertains to the estimation method required to obtain the parameter

    estimates not to the way that they enter the regression function

    191

    Greene 50240

    book

    June 11 2002

    18 51

    192

    CHAPTER 10 Nonspherical Disturbances

    The rst mentioned situation involving nancial data is more complex than this and is examined in detail in Section 11 8 Autocorrelation is usually found in time series data Economic time series often display a memory in that variation around the regression function is not independent from one period to the next The seasonally adjusted price and quantity series published by government agencies are examples Time series data are usually homoscedastic so 2 might be 1 1 n 1 1 1 n 2 2 2 n 1 n 2 1 The values that appear off the diagonal depend on the model used for the disturbance In most cases consistent with the notion of a fading memory the values decline as we move away from the diagonal Panel data sets consisting of cross sections observed at several points in time may exhibit both characteristics We shall consider them in Chapter 14 This chapter presents some general results for this extended model The next several chapters examine in detail speci c types of generalized regression models Our earlier results for the classical model will have to be modi ed We will take the same approach in this chapter on general results and in the next two on heteroscedasticity and serial correlation respectively 1 We rst consider the consequences for the least squares estimator of the more general form of the regression model This will include assessing the effect of ignoring the complication of the generalized model and of devising an appropriate estimation strategy still based on least squares In subsequent sections we will examine alternative estimation approaches that can make better use of the characteristics of the model We begin with GMM estimation which is robust and semiparametric Minimal assumptions about are made at this point We then narrow the assumptions and begin to look for methods of detecting the failure of the classical model that is we formulate procedures for testing the speci cation of the classical model against the generalized regression The nal step in the analysis is to formulate parametric models that make speci c assumptions about Estimators in this setting are some form of generalized least squares or maximum likelihood

    2

    3

    4

    The model is examined in general terms in this and the next two chapters Major applications to panel data and multiple equation systems are considered in Chapters 13 and 14

    10 2

    LEAST SQUARES AND INSTRUMENTAL VARIABLES ESTIMATION

    The essential results for the classical model with spherical disturbances E X 0

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    193

    and E X 2 I 10 2

    are presented in Chapters 2 through 8 To reiterate we found that the ordinary least squares OLS estimator b X X 1 X y X X 1 X 10 3

    is best linear unbiased BLU consistent and asymptotically normally distributed CAN and if the disturbances are normally distributed like other maximum likelihood estimators considered in Chapter 17 asymptotically ef cient among all CAN estimators We now consider which of these properties continue to hold in the model of 10 1 To summarize the least squares nonlinear least squares and instrumental variables estimators retain only some of their desirable properties in this model Least squares remains unbiased consistent and asymptotically normally distributed It will however no longer be ef cient this claim remains to be veri ed and the usual inference procedures are no longer appropriate Nonlinear least squares and instrumental variables likewise remain consistent but once again the extension of the model brings about some changes in our earlier results concerning the asymptotic distributions We will consider these cases in detail
    10 2 1 FINITE SAMPLE PROPERTIES OF ORDINARY LEAST SQUARES

    By taking expectations on both sides of 10 3 we nd that if E X 0 then E b EX E b X Therefore we have the following theorem 10 4

    THEOREM 10 1 Finite Sample Properties of b in the Generalized Regression Model If the regressors and disturbances are uncorrelated then the unbiasedness of least squares is unaffected by violations of assumption 10 2 The least squares estimator is unbiased in the generalized regression model With nonstochastic regressors or conditional on X the sampling variance of the least squares estimator is Var b X E b b X E X X 1 X X X X 1 X X X 1 X 2 X X X 1 2 n 1 XX n
    1

    10 5 1 XX n
    1

    1 X n

    X



    If the regressors are stochastic then the unconditional variance is EX Var b X In 10 3 b is a linear function of Therefore if is normally distributed then b X N 2 X X 1 X X X X 1

    Greene 50240

    book

    June 11 2002

    18 51

    194

    CHAPTER 10 Nonspherical Disturbances

    The end result is that b has properties that are similar to those in the classical regression case Since the variance of the least squares estimator is not 2 X X 1 however statistical inference based on s 2 X X 1 may be misleading Not only is this the wrong matrix to be used but s 2 may be a biased estimator of 2 There is usually no way to know whether 2 X X 1 is larger or smaller than the true variance of b so even with a good estimate of 2 the conventional estimator of Var b may not be particularly useful Finally since we have dispensed with the fundamental underlying assumption the familiar inference procedures based on the F and t distributions will no longer be appropriate One issue we will explore at several points below is how badly one is likely to go awry if the result in 10 5 is ignored and if the use of the familiar procedures based on s 2 X X 1 is continued
    10 2 2 ASYMPTOTIC PROPERTIES OF LEAST SQUARES

    If Var b X converges to zero then b is mean square consistent With well behaved regressors X X n 1 will converge to a constant matrix But 2 n X X n need not converge at all By writing this product as 2 n X n X 2 n
    n i 1 n j 1

    i j xi x j

    n

    10 6

    we see that though the leading constant will by itself converge to zero the matrix is a sum of n2 terms divided by n Thus the product is a scalar that is O 1 n times a matrix that is at least at this juncture O n which is O 1 So it does appear at rst blush that if the product in 10 6 does converge it might converge to a matrix of nonzero constants In this case the covariance matrix of the least squares estimator would not converge to zero and consistency would be dif cult to establish We will examine in some detail the conditions under which the matrix in 10 6 converges to a constant matrix 2 If it does then since 2 n does vanish ordinary least squares is consistent as well as unbiased

    THEOREM 10 2 Consistency of OLS in the Generalized Regression Model If Q plim X X n and plim X X n are both nite positive de nite matrices then b is consistent for Under the assumed conditions plim b 10 7

    The conditions in Theorem 10 2 depend on both X and An alternative formula3 that separates the two components is as follows Ordinary least squares is consistent in the generalized regression model if 1 The smallest characteristic root of X X increases without bound as n which implies that plim X X 1 0 If the regressors satisfy the Grenander conditions G1 through G3 of Section 5 2 then they will meet this requirement
    order for the product in 10 6 to vanish it would be suf cient for X 1985 p 184 X n to be O n where 1

    2 In

    3 Amemiya

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    195

    2

    The largest characteristic root of is nite for all n For the heteroscedastic model the variances are the characteristic roots which requires them to be nite For models with autocorrelation the requirements are that the elements of be nite and that the off diagonal elements not be too large relative to the diagonal elements We will examine this condition at several points below

    The least squares estimator is asymptotically normally distributed if the limiting distribution of n b XX n
    1

    1 X n

    10 8

    is normal If plim X X n Q then the limiting distribution of the right hand side is the same as that of 1 1 vn LS Q 1 X Q 1 n n
    n

    xi i
    i 1

    10 9

    where xi is a row of X assuming of course that the limiting distribution exists at all The question now is whether a central limit theorem can be applied directly to v If the disturbances are merely heteroscedastic and still uncorrelated then the answer is generally yes In fact we already showed this result in Section 5 5 2 when we invoked the Lindberg Feller central limit theorem D 19 or the Lyapounov Theorem D 20 The theorems allow unequal variances in the sum The exact variance of the sum is 1 Ex Var n
    n

    xi i
    i 1

    xi

    2 n

    n

    i Qi
    i 1

    which for our purposes we would require to converge to a positive de nite matrix In our analysis of the classical model the heterogeneity of the variances arose because of the regressors but we still achieved the limiting normal distribution in 5 7 through 5 14 All that has changed here is that the variance of varies across observations as well Therefore the proof of asymptotic normality in Section 5 2 2 is general enough to include this model without modi cation As long as X is well behaved and the diagonal elements of are nite and well behaved the least squares estimator is asymptotically normally distributed with the covariance matrix given in 10 5 That is In the heteroscedastic case if the variances of i are nite and are not dominated by any single term so that the conditions of the Lindberg Feller central limit theorem apply to vn LS in 10 9 then the least squares estimator is asymptotically normally distributed with covariance matrix Asy Var b 2 1 1 Q plim X n n X Q 1 10 10

    For the most general case asymptotic normality is much more dif cult to establish because the sums in 10 9 are not necessarily sums of independent or even uncorrelated random variables Nonetheless Amemiya 1985 p 187 and Anderson 1971 have shown the asymptotic normality of b in a model of autocorrelated disturbances general enough to include most of the settings we are likely to meet in practice We will revisit

    Greene 50240

    book

    June 11 2002

    18 51

    196

    CHAPTER 10 Nonspherical Disturbances

    this issue in Chapters 19 and 20 when we examine time series modeling We can conclude that except in particularly unfavorable cases we have the following theorem

    THEOREM 10 3 Asymptotic Distribution of b in the GR Model If the regressors are suf ciently well behaved and the off diagonal terms in diminish suf ciently rapidly then the least squares estimator is asymptotically normally distributed with mean and covariance matrix given in 10 10

    There are two cases that remain to be considered the nonlinear regression model and the instrumental variables estimator
    10 2 3 ASYMPTOTIC PROPERTIES OF NONLINEAR LEAST SQUARES

    If the regression function is nonlinear then the analysis of this section must be applied to the pseudoregressors xi0 rather than the independent variables Aside from this consideration no new results are needed We can just apply this discussion to the linearized regression model Under most conditions the results listed above apply to the nonlinear least squares estimator as well as the linear least squares estimator 4
    10 2 4 ASYMPTOTIC PROPERTIES OF THE INSTRUMENTAL VARIABLES ESTIMATOR

    The second estimator to be considered is the instrumental variables estimator that we considered in Sections 5 4 for the linear model and 9 5 1 for the nonlinear model We will con ne our attention to the linear model The nonlinear case can be obtained by applying our results to the linearized regression To review we considered cases in which the regressors X are correlated with the disturbances If this is the case as in the timeseries models and the errors in variables models that we examined earlier then b is neither unbiased nor consistent 5 In the classical model we constructed an estimator around a set of variables Z that were uncorrelated with bIV X Z Z Z 1 Z X 1 X Z Z Z 1 Z y X Z Z Z 1 Z X 1 X Z Z Z 1 Z 10 11

    Suppose that X and Z are well behaved in the sense discussed in Section 5 4 That is plim 1 n Z Z QZZ a positive de nite matrix plim 1 n Z X QZX QXZ a nonzero matrix plim 1 n X X QXX a positive de nite matrix

    4 Davidson 5 It

    and MacKinnon 1993 consider this case at length

    may be asymptotically normally distributed but around a mean that differs from

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    197

    To avoid a string of matrix computations that may not t on a single line for convenience let QXX Z QXZ Q 1 QZX ZZ plim 1 XZ n
    1

    QXZ Q 1 ZZ
    1

    1 ZZ n

    1 ZX n

    1

    1 XZ n

    1 ZZ n

    1



    If Z is a valid set of instrumental variables that is if the second term in 10 11 vanishes asymptotically then plim bIV QXX Z plim 1 Z n

    This result is exactly the same one we had before We might note that at the several points where we have established unbiasedness or consistency of the least squares or instrumental variables estimator the covariance matrix of the disturbance vector has played no role unbiasedness is a property of the means As such this result should come as no surprise The large sample behavior of bIV depends on the behavior of 1 vn IV n
    n

    zi i
    i 1

    This result is exactly the one we analyzed in Section 5 4 If the sampling distribution of vn converges to a normal distribution then we will be able to construct the asymptotic distribution for bIV This set of conditions is the same that was necessary for X when we considered b above with Z in place of X We will once again rely on the results of Anderson 1971 or Amemiya 1985 that under very general conditions 1 n
    n i 1

    zi i N 0 2 plim

    d

    1 Z n

    Z



    With the other results already in hand we now have the following

    THEOREM 10 4 Asymptotic Distribution of the IV Estimator in the Generalized Regression Model If the regressors and the instrumental variables are well behaved in the fashions discussed above then bIV N VIV where VIV 1 QXX Z plim Z n n
    2 a

    10 12 Z QXX Z

    Greene 50240

    book

    June 11 2002

    18 51

    198

    CHAPTER 10 Nonspherical Disturbances

    10 3

    ROBUST ESTIMATION OF ASYMPTOTIC COVARIANCE MATRICES

    There is a remaining question regarding all the preceding In view of 10 5 is it necessary to discard ordinary least squares as an estimator Certainly if is known then as shown in Section 10 5 there is a simple and ef cient estimator available based on it and the answer is yes If is unknown but its structure is known and we can estimate using sample information then the answer is less clear cut In many cases basing estimation of on some alternative procedure that uses an will be preferable to ordinary least squares This subject is covered in Chapters 11 to 14 The third possibility is that is completely unknown both as to its structure and the speci c values of its elements In this situation least squares or instrumental variables may be the only estimator available and as such the only available strategy is to try to devise an estimator for the appropriate asymptotic covariance matrix of b If 2 were known then the estimator of the asymptotic covariance matrix of b in 10 10 would be VOLS 11 XX nn
    1

    1 X 2 X n

    1 XX n

    1



    For the nonlinear least squares estimator we replace X with X0 For the instrumental variables estimator the left and right side matrices are replaced with this sample estimates of QXX Z and its transpose using X0 again for the nonlinear instrumental variables estimator and Z replaces X in the center matrix In all these cases the matrices of sums of squares and cross products in the left and right matrices are sample data that are readily estimable and the problem is the center matrix that involves the unknown 2 For estimation purposes note that 2 is not a separate unknown parameter Since is an unknown matrix it can be scaled arbitrarily say by and with 2 scaled by 1 the same product remains In our applications we will remove the indeterminacy by assuming that tr n as it is when 2 2 I in the classical model For now just let 2 It might seem that to estimate 1 n X X an estimator of which contains n n 1 2 unknown parameters is required But fortunately since with n observations this method is going to be hopeless this observation is not quite right What is required is an estimator of the K K 1 2 unknown elements in the matrix plim Q plim 1 n
    n n

    i j xi x j
    i 1 j 1

    The point is that Q is a matrix of sums of squares and cross products that involves ij and the rows of X or Z or X0 The least squares estimator b is a consistent estimator of which implies that the least squares residuals ei are pointwise consistent estimators of their population counterparts i The general approach then will be to use X and e to devise an estimator of Q Consider the heteroscedasticity case rst We seek an estimator of 1 Q n
    n

    i2 xi xi
    i 1

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    199

    White 1980 has shown that under very general conditions the estimator S0 has plim S0 plim Q 6 We can sketch a proof of this result using the results we obtained in Section 5 2 7 Note rst that Q is not a parameter matrix in itself It is a weighted sum of the outer products of the rows of X or Z for the instrumental variables case Thus we seek not to estimate Q but to nd a function of the sample data that will be arbitrarily close to this function of the population parameters as the sample size grows large The distinction is important We are not estimating the middle matrix in 10 10 or 10 12 we are attempting to construct a matrix from the sample data that will behave the same way that this matrix behaves In essence if Q converges to a nite positive matrix then we would be looking for a function of the sample data that converges to the same matrix Suppose that the true disturbances i could be observed Then each term in Q would equal E i2 xi xi xi With some fairly mild assumptions about xi then we could invoke a law of large numbers see Theorems D 2 through D 4 to state that if Q has a probability limit then plim 1 n
    n

    1 n

    n

    ei2 xi xi
    i 1

    10 13

    i2 xi xi plim
    i 1

    1 n

    n

    i2 xi xi
    i 1

    The nal detail is to justify the replacement of i with ei in S0 The consistency of b for is suf cient for the argument Actually residuals based on any consistent estimator of would suf ce for this estimator but as of now b or bIV is the only one in hand The end result is that the White heteroscedasticity consistent estimator Est Asy Var b 11 XX nn
    1

    1 n

    n

    ei2 xi xi
    i 1

    1 XX n

    1

    10 14

    n X X 1 S0 X X 1 can be used to estimate the asymptotic covariance matrix of b This result is extremely important and useful 8 It implies that without actually specifying the type of heteroscedasticity we can still make appropriate inferences based on the results of least squares This implication is especially useful if we are unsure of the precise nature of the heteroscedasticity which is probably most of the time We will pursue some examples in Chapter 11
    6 See 7 We

    also Eicker 1967 Horn Horn and Duncan 1975 and MacKinnon and White 1985 will give only a broad sketch of the proof Formal results appear in White 1980 and 2001

    8 Further discussion and some re nements may be found in Cragg 1982 Cragg shows how White s observation can be extended to devise an estimator that improves on the ef ciency of ordinary least squares

    Greene 50240

    book

    June 11 2002

    18 51

    200

    CHAPTER 10 Nonspherical Disturbances

    The extension of White s result to the more general case of autocorrelation is much more dif cult The natural counterpart for estimating Q would be 1 Q n
    n n

    1 n

    n

    n

    i j xi x j
    i 1 j 1

    10 15 ei e j xi x j
    i 1 j 1

    But there are two problems with this estimator one theoretical which applies to Q as well and one practical which is speci c to the latter Unlike the heteroscedasticity case the matrix in 10 15 is 1 n times a sum of n2 terms so it is dif cult to conclude yet that it will converge to anything at all This application is most likely to arise in a time series setting To obtain convergence it is necessary to assume that the terms involving unequal subscripts in 10 15 diminish in importance as n grows A suf cient condition is that terms with subscript pairs i j grow smaller as the distance between them grows larger In practical terms observation pairs are progressively less correlated as their separation in time grows Intuitively if one can think of weights with the diagonal elements getting a weight of 1 0 then in the sum the weights in the sum grow smaller as we move away from the diagonal If we think of the sum of the weights rather than just the number of terms then this sum falls off suf ciently rapidly that as n grows large the sum is of order n rather than n2 Thus we achieve convergence of Q by assuming that the rows of X are well behaved and that the correlations diminish with increasing separation in time See Sections 5 3 12 5 and 20 5 for a more formal statement of this condition The practical problem is that Q need not be positive de nite Newey and West 1987a have devised an estimator that overcomes this dif culty 1 Q S0 n
    L n

    wl et et l xt xt l xt l xt
    l 1 t l 1

    10 16

    l wl 1 L 1 The Newey West autocorrelation consistent covariance estimator is surprisingly simple and relatively easy to implement 9 There is a nal problem to be solved It must be determined in advance how large L is to be We will examine some special cases in Chapter 12 but in general there is little theoretical guidance Current practice speci es L T1 4 Unfortunately the result is not quite as crisp as that for the heteroscedasticity consistent estimator We have the result that b and bIV are asymptotically normally distributed and we have an appropriate estimator for the asymptotic covariance matrix We have not speci ed the distribution of the disturbances however Thus for inference purposes the F statistic is approximate at best Moreover for more involved hypotheses the likelihood ratio and Lagrange multiplier tests are unavailable That leaves the Wald
    9 Both

    estimators are now standard features in modern econometrics computer programs Further results on different weighting schemes may be found in Hayashi 2000 pp 406 410

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    201

    statistic including asymptotic t ratios as the main tool for statistical inference We will examine a number of applications in the chapters to follow The White and Newey West estimators are standard in the econometrics literature We will encounter them at many points in the discussion to follow

    10 4

    GENERALIZED METHOD OF MOMENTS ESTIMATION

    We will analyze this estimation technique in some detail in Chapter 18 so we will only sketch the important results here It is useful to consider the instrumental variables case as it is fairly general and we can easily specialize it to the simpler regression model if that is appropriate Thus we depart from the model speci cation in 10 1 but at this point we no longer require that E i xi 0 Instead we adopt the instrumental variables formulation in Section 10 2 4 That is our model is yi xi i E i zi 0 for K variables in xi and for some set of L instrumental variables zi where L K The earlier case of the generalized regression model arises if zi xi and the classical regression form results if we add I as well so this is a convenient encompassing model framework In the next section on generalized least squares estimation we will consider two cases rst with a known then with an unknown that must be estimated In estimation by the generalized method of moments neither of these approaches is relevant because we begin with much less assumed knowledge about the data generating process In particular we will consider three cases



    Classical regression Var i X Z 2 Heteroscedasticity Var i X Z i2 Generalized model Cov t s X Z 2 ts

    where Z and X are the n L and n K observed data matrices We assume as will often be true that the fully general case will apply in a time series setting Hence the change in the subscripts No speci c distribution is assumed for the disturbances conditional or unconditional The assumption E i zi 0 implies the following orthogonality condition Cov zi i 0 or E zi yi xi 0

    By summing the terms we nd that this further implies the population moment equation E 1 n
    n

    zi yi xi E m 0
    i 1

    10 17

    This relationship suggests how we might now proceed to estimate Note in fact that if zi xi then this is just the population counterpart to the least squares normal equations

    Greene 50240

    book

    June 11 2002

    18 51

    202

    CHAPTER 10 Nonspherical Disturbances

    So as a guide to estimation this would return us to least squares Suppose we now translate this population expectation into a sample analog and use that as our guide for estimation That is if the population relationship holds for the true parameter vector suppose we attempt to mimic this result with a sample counterpart or empirical moment equation 1 n
    n

    zi yi xi
    i 1

    1 n

    n

    mi m 0
    i 1

    10 18

    In the absence of other information about the data generating process we can use the empirical moment equation as the basis of our estimation strategy The empirical moment condition is L equations the number of variables in Z in K unknowns the number of parameters we seek to estimate There are three possibilities to consider 1 Underidenti ed L K If there are fewer moment equations than there are parameters then it will not be possible to nd a solution to the equation system in 10 18 With no other information such as restrictions which would reduce the number of free parameters there is no need to proceed any further with this case For the identi ed cases it is convenient to write 10 18 as m 1 Zy n 1 Z X n 10 19

    2 Exactly identi ed If L K then you can easily show we leave it as an exercise that the single solution to our equation system is the familiar instrumental variables estimator Z X 1 Z y 10 20 3 Overidenti ed If L K then there is no unique solution to the equation system m 0 In this instance we need to formulate some strategy to choose an estimator One intuitively appealing possibility which has served well thus far is least squares In this instance that would mean choosing the estimator based on the criterion function Min q m m We do keep in mind that we will only be able to minimize this at some positive value there is no exact solution to 10 18 in the overidenti ed case Also you can verify that if we treat the exactly identi ed case as if it were overidenti ed that is use least squares anyway we will still obtain the IV estimator shown in 10 20 for the solution to case 2 For the overidenti ed case the rst order conditions are q m 2 m 2G m 1 2 XZ n 1 1 Z y Z X n n 0

    10 21

    We leave as exercise to show that the solution in both cases 2 and 3 is now X Z Z X 1 X Z Z y 10 22

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    203

    The estimator in 10 22 is a hybrid that we have not encountered before though if L K then it does reduce to the earlier one in 10 20 In the overidenti ed case 10 22 is not an IV estimator it is as we have sought a method of moments estimator It remains to establish consistency and to obtain the asymptotic distribution and an asymptotic covariance matrix for the estimator These are analyzed in detail in Chapter 18 Our purpose here is only to sketch the formal result so we will merely claim the intermediate results we need ASSUMPTION GMM1 Convergence of the moments The population moment con verges in probability to its population counterpart That is m 0 Different circumstances will produce different kinds of convergence but we will require it in some form For the simplest cases such as a model of heteroscedasticity this will be convergence in mean square Certain time series models that involve correlated observations will necessitate some other form of convergence But in any of the cases we consider we will require the general result plim m 0

    ASSUMPTION GMM2 Identi cation The parameters are identi ed in terms of the moment equations Identi cation means essentially that a large enough sample will contain suf cient information for us actually to estimate consistently using the sample moments There are two conditions which must be met an order condition which we have already assumed L K and a rank condition which states that the moment equations are not redundant The rank condition implies the order condition so we need only formalize it Identi cation condition for GMM Estimation The L K matrix m 1 E G plim G plim plim n
    10 n i 1

    mi

    must have full row rank equal to L Since this requires L K this implies the order condition This assumption means that this derivative matrix converges in probability to its expectation Note that we have assumed in addition that the derivatives like the moments themselves obey a law of large numbers they converge in probability to their expectations ASSUMPTION GMM3 Limiting Normal Distribution for the Sample Moments The population moment obeys a central limit theorem or some similar variant Since we are studying a generalized regression model Lindberg Levy D 19 will be too narrow the observations will have different variances Lindberg Feller D 19 A suf ces in the heteroscedasticity case but in the general case we will ultimately require something more general These theorems are discussed in Section 12 4 and invoked in Chapter 18

    10 Strictly

    speaking we only require that the row rank be at least as large as K so there could be redundant that is functionally dependent moments so long as there are at least K that are functionally independent The case of rank greater than or equal to K but less than L can be ignored

    Greene 50240

    book

    June 11 2002

    18 51

    204

    CHAPTER 10 Nonspherical Disturbances

    It will follow from these assumptions again at this point we do this without proof that the GMM estimators that we obtain are in fact consistent By virtue of the Slutsky theorem we can transfer our limiting results above to the empirical moment equations A proof of consistency of the GMM estimator pursued in Chapter 18 will be based on this result To obtain the asymptotic covariance matrix we will simply invoke a result we will obtain more formally in Chapter 18 for generalized method of moments estimators That is Asy Var 1 n
    1

    Asy Var n m





    1



    For the particular model we are studying here m 1 n Z y Z X G 1 n Z X QZX from Section 10 2 4 You should check in the preceding expression that the dimensions of the particular matrices and the dimensions of the various products produce the correctly con gured matrix that we seek The remaining detail which is the crucial one for the model we are examining is for us to determine V Asy Var n m Given the form of m 1 V Var n
    n

    zi i
    i 1

    1 n

    n

    n

    2 i j zi z j 2
    i 1 j 1

    Z n

    Z

    for the most general case Note that this is precisely the expression that appears in 10 6 so the question that arose there arises here once again That is under what conditions will this converge to a constant matrix We take the discussion there as given The only remaining detail is how to estimate this matrix The answer appears in Section 10 3 where we pursued this same question in connection with robust estimation of the asymptotic covariance matrix of the least squares estimator To review then what we have achieved to this point is to provide a theoretical foundation for the instrumental variables estimator As noted earlier this specializes to the least squares estimator The estimators of V for our three cases will be



    Classical regression e e n V n
    n

    zi zi
    i 1

    e e n ZZ n



    Heteroscedastic 1 V n
    n

    ei2 zi zi
    i 1

    10 23

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    205



    General 1 V n
    n L n

    et2 zt zt
    i 1


    l 1 t l 1

    1

    l et et l zt zt l zt l zt L 1

    We should observe that in each of these cases we have actually used some information about the structure of If it is known only that the terms in m are uncorrelated then there is a convenient estimator available 1 V n
    n

    mi mi
    i 1

    that is the natural empirical variance estimator Note that this is what is being used in the heteroscedasticity case directly above Collecting all the terms so far then we have Est Asy Var 1 1 1 G G G VG G G n 10 24

    n X Z Z X 1 X Z V Z X X Z Z X 1 The preceding would seem to endow the least squares or method of moments estimators with some degree of optimality but that is not the case We have only provided them with a different statistical motivation and established consistency We now consider the question of whether since this is the generalized regression model there is some better more ef cient means of using the data As before we merely sketch the results The class of minimum distance estimators is de ned by the solutions to the criterion function Min q m Wm where W is any positive de nite weighting matrix Based on the assumptions made above we will have the following theorem which we claim without proof at this point

    THEOREM 10 5 Minimum Distance Estimators If plim m 0 and if W is a positive de nite matrix then plim Argmin q m Wm The minimum distance estimator is consistent It is also asymptotically normally distributed and has asymptotic covariance matrix 1 Asy Var MD G WG 1 G WVWG G WG 1 n

    Note that our entire preceding analysis was of the simplest minimum distance estimator which has W I The obvious question now arises if any W produces a consistent estimator is any W better than any other one or is it simply arbitrary There is a rm answer for which we have to consider two cases separately



    Exactly identi ed case If L K that is if the number of moment conditions is the same as the number of parameters being estimated then W is irrelevant to the solution so on the basis of simplicity alone the optimal W is I

    Greene 50240

    book

    June 11 2002

    18 51

    206

    CHAPTER 10 Nonspherical Disturbances



    Overidenti ed case In this case the optimal weighting matrix that is the W which produces the most ef cient estimator is W V 1 That is the best weighting matrix is the inverse of the asymptotic covariance of the moment vector

    THEOREM 10 6 Generalized Method of Moments Estimator The Minimum Distance Estimator obtained by using W V 1 is the Generalized Method of Moments or GMM estimator The GMM estimator is consistent asymptotically normally distributed and has asymptotic covariance matrix equal to 1 Asy Var GMM G V 1 G 1 n For the generalized regression model these are GMM X Z V 1 Z X 1 X Z V 1 Z y and Asy Var GMM X Z V Z X 1

    We conclude this discussion by tying together what should seem to be a loose end The GMM estimator is computed as the solution to 1 Min q m Asy Var n m m which suggests that the weighting matrix is a function of the thing we are trying to estimate The process of GMM estimation will have to proceed in two steps Step 1 is to obtain an estimate of V then Step 2 will consist of using the inverse of this V as the weighting matrix in computing the GMM estimator We will return to this in Chapter 18 so we note directly the following is a common strategy
    Step 1 Use W I to obtain a consistent estimator of Then estimate V with V 1 n
    n

    ei2 zi zi
    i 1

    in the heteroscedasticity case i e the White estimator or for the more general case the Newey West estimator in 10 23
    Step 2 Use W V 1 to compute the GMM estimator

    At this point the observant reader should have noticed that in all of the preceding we have never actually encountered the simple instrumental variables estimator that

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    207

    we introduced in Section 5 4 In order to obtain this estimator we must revert back to the classical that is homoscedastic and nonautocorrelated disturbances case In that instance the weighting matrix in Theorem 10 5 will be W Z Z 1 and we will obtain the apparently missing result

    10 5

    EFFICIENT ESTIMATION BY GENERALIZED LEAST SQUARES

    Ef cient estimation of in the generalized regression model requires knowledge of To begin it is useful to consider cases in which is a known symmetric positive de nite matrix This assumption will occasionally be true but in most models will contain unknown parameters that must also be estimated We shall examine this case in Section 10 6
    10 5 1 GENERALIZED LEAST SQUARES GLS

    Since

    is a positive de nite symmetric matrix it can be factored into C C

    where the columns of C are the characteristic vectors of and the characteristic roots of are arrayed in the diagonal matrix Let 1 2 be the diagonal matrix with ith diagonal element i and let T C 1 2 Then TT Also let P C 1 2 so 1 P P Premultiply the model in 10 1 by P to obtain Py PX P or y X The variance of is E P 2 P 2 I so the classical regression model applies to this transformed model Since is known y and X are observed data In the classical model ordinary least squares is ef cient hence X X 1 X y X P PX 1 X P Py X
    1

    10 25

    X 1 X

    1

    y

    is the ef cient estimator of This estimator is the generalized least squares GLS or Aitken 1935 estimator of This estimator is in contrast to the ordinary least squares OLS estimator which uses a weighting matrix I instead of 1 By appealing to the classical regression model in 10 25 we have the following theorem which includes the generalized regression model analogs to our results of Chapters 4 and 5

    Greene 50240

    book

    June 11 2002

    18 51

    208

    CHAPTER 10 Nonspherical Disturbances

    THEOREM 10 7 Properties of the Generalized Least Squares Estimator If E X 0 then E X E X X 1 X y X E X X 1 X X The GLS estimator is unbiased This result is equivalent to E P PX 0 but since P is a matrix of known constants we return to the familiar requirement E X 0 The requirement that the regressors and disturbances be uncorrelated is unchanged The GLS estimator is consistent if plim 1 n X X Q where Q is a nite positive de nite matrix Making the substitution we see that this implies plim 1 n X
    1

    X 1 Q 1

    10 26

    We require the transformed data X PX not the original data X to be well behaved 11 Under the assumption in 10 1 the following hold The GLS estimator is asymptotically normally distributed with mean and sampling variance Var X 2 X X 1 2 X
    1

    X 1

    10 27

    The GLS estimator is the minimum variance linear unbiased estimator in the generalized regression model This statement follows by applying the Gauss Markov theorem to the model in 10 25 The result in Theorem 10 7 is Aitken s 1935 Theorem and is sometimes called the Aitken estimator This broad result includes the Gauss Markov theorem as a special case when I

    For testing hypotheses we can apply the full set of results in Chapter 6 to the transformed model in 10 25 For testing the J linear restrictions R q the appropriate statistic is R q R 2 X X 1 R 1 R q c J F J n K c J 2 where the residual vector is y X and 2 y X 1 y X n K n K
    1

    10 28

    The constrained GLS residuals c y X c are based on c X
    11 Once 12 Note

    X 1 R R X

    1

    X 1 R 1 R q 12

    again to allow a time trend we could weaken this assumption a bit that this estimator is the constrained OLS estimator using the transformed data

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    209

    To summarize all the results for the classical model including the usual inference procedures apply to the transformed model in 10 25 There is no precise counterpart to R2 in the generalized regression model Alternatives have been proposed but care must be taken when using them For example one choice is the R2 in the transformed regression 10 25 But this regression need not have a constant term so the R2 is not bounded by zero and one Even if there is a constant term the transformed regression is a computational device not the model of interest That a good or bad t is obtained in the model in 10 25 may be of no interest the dependent variable in that model y is different from the one in the model as originally speci ed The usual R2 often suggests that the t of the model is improved by a correction for heteroscedasticity and degraded by a correction for autocorrelation but both changes can often be attributed to the computation of y A more appealing t measure might be based on the residuals from the original model once the GLS estimator is in hand such as
    2 RG 1

    y X y X n 2 i 1 yi y

    Like the earlier contender however this measure is not bounded in the unit interval In addition this measure cannot be reliably used to compare models The generalized least squares estimator minimizes the generalized sum of squares y X
    1

    y X

    not As such there is no assurance for example that dropping a variable from the 2 model will result in a decrease in RG as it will in R2 Other goodness of t measures designed primarily to be a function of the sum of squared residuals raw or weighted by 1 and to be bounded by zero and one have been proposed 13 Unfortunately they all suffer from at least one of the previously noted shortcomings The R2 like measures in this setting are purely descriptive
    10 5 2 FEASIBLE GENERALIZED LEAST SQUARES

    To use the results of Section 10 5 1 must be known If contains unknown parameters that must be estimated then generalized least squares is not feasible But with an unrestricted there are n n 1 2 additional parameters in 2 This number is far too many to estimate with n observations Obviously some structure must be imposed on the model if we are to proceed The typical problem involves a small set of parameters such that A commonly used formula in time series settings is 1 2 3 n 1 1 2 n 2 n 1 n 2 1

    13 See

    example Judge et al 1985 p 32 and Buse 1973

    Greene 50240

    book

    June 11 2002

    18 51

    210

    CHAPTER 10 Nonspherical Disturbances

    which involves only one additional unknown parameter A model of heteroscedasticity that also has only one new parameter is i2 2 zi 10 29

    Suppose then that is a consistent estimator of We consider later how such an estimator might be obtained To make GLS estimation feasible we shall use requires us to instead of the true The issue we consider here is whether using change any of the results of Section 10 5 1 It would seem that if plim then using is asymptotically equivalent to using 14 the true Let the feasible generalized least squares FGLS estimator be denoted X 1 X 1 X 1 y Conditions that imply that is asymptotically equivalent to are plim and plim 1 X 1 n 1 X n
    1

    1 1 X X n

    1 X n

    1

    X

    0

    10 30



    0

    10 31

    The rst of these equations states that if the weighted sum of squares matrix based on the true converges to a positive de nite matrix then the one based on converges to the same matrix We are assuming that this is true In the second condition if the transformed regressors are well behaved then the right hand side sum will have a limiting normal distribution This condition is exactly the one we used in Chapter 5 to obtain the asymptotic distribution of the least squares estimator here we are using the same results for X and Therefore 10 31 requires the same condition to hold when is replaced with 15 These conditions in principle must be veri ed on a case by case basis Fortunately in most familiar settings they are met If we assume that they are then the FGLS estimator based on has the same asymptotic properties as the GLS estimator This result is extremely useful Note especially the following theorem

    THEOREM 10 8 Ef ciency of the FGLS Estimator An asymptotically ef cient FGLS estimator does not require that we have an ef cient estimator of only a consistent one is required to achieve full ef ciency for the FGLS estimator

    14 This 15 The

    equation is sometimes denoted plim Since use this term to indicate convergence element by element

    is n n it cannot have a probability limit We

    condition actually requires only that if the right hand sum has any limiting distribution then the lefthand one has the same one Conceivably this distribution might not be the normal distribution but that seems unlikely except in a specially constructed theoretical case

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    211

    Except for the simplest cases the nite sample properties and exact distributions of FGLS estimators are unknown The asymptotic ef ciency of FGLS estimators may not carry over to small samples because of the variability introduced by the estimated Some analyses for the case of heteroscedasticity are given by Taylor 1977 A model of autocorrelation is analyzed by Griliches and Rao 1969 In both studies the authors nd that over a broad range of parameters FGLS is more ef cient than least squares But if the departure from the classical assumptions is not too severe then least squares may be more ef cient than FGLS in a small sample

    10 6

    MAXIMUM LIKELIHOOD ESTIMATION

    This section considers ef cient estimation when the disturbances are normally distributed As before we consider two cases rst to set the stage the benchmark case of known and second the more common case of unknown 16 If the disturbances are multivariate normally distributed then the log likelihood function for the sample is n n 1 ln L ln 2 ln 2 y X 2 2 2 2
    1

    y X

    1 ln 2

    10 32

    Since is a matrix of known constants the maximum likelihood estimator of is the vector that minimizes the generalized sum of squares S y X
    1

    y X

    hence the name generalized least squares The necessary conditions for maximizing L are ln L 1 2X
    1

    y X

    1 X y X 0 2
    1

    ln L n 1 2 y X 2 2 2 4

    y X

    10 33

    n 1 y X y X 0 2 2 2 4
    1

    The solutions are the OLS estimators using the transformed data ML X X 1 X y X ML 2 X 1 X
    1

    y

    10 34

    1 y X y X n
    1

    1 y X n

    10 35

    y X

    which implies that with normally distributed disturbances generalized least squares is
    16 The

    method of maximum likelihood estimation is developed in Chapter 17

    Greene 50240

    book

    June 11 2002

    18 51

    212

    CHAPTER 10 Nonspherical Disturbances

    also maximum likelihood As in the classical regression model the maximum likelihood estimator of 2 is biased An unbiased estimator is the one in 10 28 The conclusion which would be expected is that when is known the maximum likelihood estimator is generalized least squares When is unknown and must be estimated then it is necessary to maximize the log likelihood in 10 32 with respect to the full set of parameters 2 simultaneously Since an unrestricted alone contains n n 1 2 1 parameters it is clear that some restriction will have to be placed on the structure of in order for estimation to proceed We will examine several applications in which for some smaller vector of parameters in the next two chapters so we will note only a few general results at this point a For a given value of the estimator of would be feasible GLS and the estimator of 2 would be the estimator in 10 35 b The likelihood equations for will generally be complicated functions of and 2 so joint estimation will be necessary However in many cases for given values of and 2 the estimator of is straightforward For example in the model of 10 29 the iterated estimator of when and 2 and a prior value of are given is the prior value plus the slope in the regression of ei2 i2 1 on zi The second step suggests a sort of back and forth iteration for this model that will work in many situations starting with say OLS iterating back and forth between a and b until convergence will produce the joint maximum likelihood estimator This situation was examined by Oberhofer and Kmenta 1974 who showed that under some fairly weak requirements most importantly that not involve 2 or any of the parameters in this procedure would produce the maximum likelihood estimator Another implication of this formulation which is simple to show we leave it as an exercise is that under the Oberhofer and Kmenta assumption the asymptotic covariance matrix of the estimator is the same as the GLS estimator This is the same whether is known or estimated which means that if and have no parameters in common then exact knowledge of brings no gain in asymptotic ef ciency in the estimation of over estimation of with a consistent estimator of

    10 7

    SUMMARY AND CONCLUSIONS

    This chapter has introduced a major extension of the classical linear model By allowing for heteroscedasticity and autocorrelation in the disturbances we expand the range of models to a large array of frameworks We will explore these in the next several chapters The formal concepts introduced in this chapter include how this extension affects the properties of the least squares estimator how an appropriate estimator of the asymptotic covariance matrix of the least squares estimator can be computed in this extended modeling framework and nally how to use the information about the variances and covariances of the disturbances to obtain an estimator that is more ef cient than ordinary least squares

    Greene 50240

    book

    June 11 2002

    18 51

    CHAPTER 10 Nonspherical Disturbances

    213

    Key Terms and Concepts
    Aitken s Theorem Asymptotic properties Autocorrelation Ef cient estimator Feasible GLS Finite sample properties Generalized least squares Heteroscedasticity Instrumental variables Orthogonality condition Panel data Parametric Population moment

    estimator
    Method of moments

    estimator
    Newey West estimator Nonlinear least squares

    equation
    Rank condition Robust estimation Semiparametric Weighting matrix White estimator

    GLS
    Generalized regression

    estimator
    Order condition Ordinary least squares

    model
    GMM estimator

    OLS

    Exercises 1 What is the covariance matrix Cov b of the GLS estimator 1 1 1 X X X y and the difference between it and the OLS estimator b X X 1 X y The result plays a pivotal role in the development of speci cation tests in Hausman 1978 This and the next two exercises are based on the test statistic usually used to test a set of J linear restrictions in the generalized regression model F J n K R q R X y X X 1 R 1 R q J 1 n K y X
    1

    2

    3

    where is the GLS estimator Show that if is known if the disturbances are normally distributed and if the null hypothesis R q is true then this statistic is exactly distributed as F with J and n K degrees of freedom What assumptions about the regressors are needed to reach this conclusion Need they be nonstochastic Now suppose that the disturbances are not normally distributed although is still known Show that the limiting distribution of previous statistic is 1 J times a chisquared variable with J degrees of freedom Hint The denominator converges to 2 Conclude that in the generalized regression model the limiting distribution of the Wald statistic W R q R Est Var R
    1

    R q

    4

    is chi squared with J degrees of freedom regardless of the distribution of the disturbances as long as the data are otherwise well behaved Note that in a nite sample the true distribution may be approximated with an F J n K distribution It is a bit ambiguous however to interpret this fact as implying that the statistic is asymptotically distributed as F with J and n K degrees of freedom because the limiting distribution used to obtain our result is the chi squared not the F In this instance the F J n K is a random variable that tends asymptotically to the chi squared variate Finally suppose that must be estimated but that assumptions 10 27 and 10 31 are met by the estimator What changes are required in the development of the previous problem

    Greene 50240

    book

    June 11 2002

    18 51

    214

    CHAPTER 10 Nonspherical Disturbances

    5

    6

    7

    8

    In the generalized regression model if the K columns of X are characteristic vectors of then ordinary least squares and generalized least squares are identical The result is actually a bit broader X may be any linear combination of exactly K characteristic vectors This result is Kruskal s Theorem a Prove the result directly using matrix algebra b Prove that if X contains a constant term and if the remaining columns are in deviation form so that the column sum is zero then the model of Exercise 8 below is one of these cases The seemingly unrelated regressions model with identical regressor matrices discussed in Chapter 14 is another In the generalized regression model suppose that is known a What is the covariance matrix of the OLS and GLS estimators of b What is the covariance matrix of the OLS residual vector e y Xb c What is the covariance matrix of the GLS residual vector y X d What is the covariance matrix of the OLS and GLS residual vectors Suppose that y has the pdf f y x 1 x e y x y 0 Then E y x x and Var y x x 2 For this model prove that GLS and MLE are the same even though this distribution involves the same parameters in the conditional mean function and the disturbance variance Suppose that the regression model is y where has a zero mean constant variance and equal correlation across observations Then Cov i j 2 if i j Prove that the least squares estimator of is inconsistent Find the characteristic roots of and show that Condition 2 after Theorem 10 2 is violated

    Greene 50240

    book

    June 17 2002

    16 21

    11

    HETEROSCEDASTICITY

    Q
    11 1 INTRODUCTION Regression disturbances whose variances are not constant across observations are heteroscedastic Heteroscedasticity arises in numerous applications in both cross section and time series data For example even after accounting for rm sizes we expect to observe greater variation in the pro ts of large rms than in those of small ones The variance of pro ts might also depend on product diversi cation research and development expenditure and industry characteristics and therefore might also vary across rms of similar sizes When analyzing family spending patterns we nd that there is greater variation in expenditure on certain commodity groups among high income families than low ones due to the greater discretion allowed by higher incomes 1 In the heteroscedastic regression model Var i xi i2 i 1 n

    We continue to assume that the disturbances are pairwise uncorrelated Thus 2 1 0 0 0 1 0 0 0 2 0 2 0 0 2 0 E X 2 2 2 0 0 0 n 0 0 0 n It will sometimes prove useful to write i2 2 i This form is an arbitrary scaling which allows us to use a normalization
    n

    tr
    i 1

    i n

    This makes the classical regression with homoscedastic disturbances a simple special case with i 1 i 1 n Intuitively one might then think of the s as weights that are scaled in such a way as to re ect only the variety in the disturbance variances The scale factor 2 then provides the overall scaling of the disturbance process
    Example 11 1 Heteroscedastic Regression

    The data in Appendix Table F9 1 give monthly credit card expenditure for 100 individuals sampled from a larger sample of 13 444 people Linear regression of monthly expenditure on a constant age income and its square and a dummy variable for home ownership using the 72 observations for which expenditure was nonzero produces the residuals plotted in Figure 11 1 The pattern of the residuals is characteristic of a regression with heteroscedasticity
    1 Prais

    and Houthakker 1955

    215

    Greene 50240

    book

    June 17 2002

    16 21

    216

    CHAPTER 11 Heteroscedasticity

    2000

    1500

    1000 U 500

    0

    500 0
    FIGURE 11 1

    2

    4

    6 Income

    8

    10

    12

    Plot of Residuals Against Income

    This chapter will present the heteroscedastic regression model rst in general terms then with some speci c forms of the disturbance covariance matrix We begin by examining the consequences of heteroscedasticity for least squares estimation We then consider robust estimation in two frameworks Section 11 2 presents appropriate estimators of the asymptotic covariance matrix of the least squares estimator Section 11 3 discusses GMM estimation Sections 11 4 to 11 7 present more speci c formulations of the model Sections 11 4 and 11 5 consider generalized weighted least squares which requires knowledge at least of the form of Section 11 7 presents maximum likelihood estimators for two speci c widely used models of heteroscedasticity Recent analyses of nancial data such as exchange rates the volatility of market returns and in ation have found abundant evidence of clustering of large and small disturbances 2 which suggests a form of heteroscedasticity in which the variance of the disturbance depends on the size of the preceding disturbance Engle 1982 suggested the AutoRegressive Conditionally Heteroscedastic or ARCH model as an alternative to the standard timeseries treatments We will examine the ARCH model in Section 11 8

    11 2

    ORDINARY LEAST SQUARES ESTIMATION

    We showed in Section 10 2 that in the presence of heteroscedasticity the least squares estimator b is still unbiased consistent and asymptotically normally distributed The

    2 Pioneering

    studies in the analysis of macroeconomic data include Engle 1982 1983 and Cragg 1982

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    217

    asymptotic covariance matrix is Asy Var b 2 1 plim X X n n
    1

    plim

    1 X n

    X

    plim

    1 XX n

    1



    Estimation of the asymptotic covariance matrix would be based on
    n

    Var b X X X 1 2
    i 1

    i xi xi X X 1

    See 10 5 Assuming as usual that the regressors are well behaved so that X X n 1 converges to a positive de nite matrix we nd that the mean square consistency of b depends on the limiting behavior of the matrix Q n X n X 1 n
    n

    i xi xi
    i 1

    11 1

    If Q converges to a positive de nite matrix Q then as n b will converge to n in mean square Under most circumstances if i is nite for all i then we would expect this result to be true Note that Q is a weighted sum of the squares and cross products n of x with weights i n which sum to 1 We have already assumed that another weighted sum X X n in which the weights are 1 n converges to a positive de nite matrix Q so it would be surprising if Q did not converge as well In general then we would expect that n b N
    a

    2 1 1 Q Q Q with Q plim Q n n

    A formal proof is based on Section 5 2 with Qi i xi xi
    11 2 1 INEFFICIENCY OF LEAST SQUARES

    It follows from our earlier results that b is inef cient relative to the GLS estimator By how much will depend on the setting but there is some generality to the pattern As might be expected the greater is the dispersion in i across observations the greater the ef ciency of GLS over OLS The impact of this on the ef ciency of estimation will depend crucially on the nature of the disturbance variances In the usual cases in which i depends on variables that appear elsewhere in the model the greater is the dispersion in these variables the greater will be the gain to using GLS It is important to note however that both these comparisons are based on knowledge of In practice one of two cases is likely to be true If we do have detailed knowledge of the performance of the inef cient estimator is a moot point We will use GLS or feasible GLS anyway In the more common case we will not have detailed knowledge of so the comparison is not possible
    11 2 2 THE ESTIMATED COVARIANCE MATRIX OF b

    If the type of heteroscedasticity is known with certainty then the ordinary least squares estimator is undesirable we should use generalized least squares instead The precise form of the heteroscedasticity is usually unknown however In that case generalized least squares is not usable and we may need to salvage what we can from the results of ordinary least squares

    Greene 50240

    book

    June 17 2002

    16 21

    218

    CHAPTER 11 Heteroscedasticity

    The conventionally estimated covariance matrix for the least squares estimator 2 X X 1 is inappropriate the appropriate matrix is 2 X X 1 X X X X 1 It is unlikely that these two would coincide so the usual estimators of the standard errors are likely to be erroneous In this section we consider how erroneous the conventional estimator is likely to be As usual s2 ee M n K n K 11 2

    where M I X X X 1 X Expanding this equation we obtain s2 X X X 1 X n K n K 11 3

    Taking the two parts separately yields E tr E X n 2 X n K n K n K 11 4

    We have used the scaling tr n In addition E tr E X X 1 X X X X X X 1 X X n K n K tr 2 X X 1 X X n n n K 2 tr n K
    1



    XX n

    Q n

    11 5

    where Q is de ned in 11 1 As n the term in 11 4 will converge to 2 The n term in 11 5 will converge to zero if b is consistent because both matrices in the product are nite Therefore If b is consistent then lim E s 2 2
    n

    It can also be shown we leave it as an exercise that if the fourth moment of every disturbance is nite and all our other assumptions are met then
    n

    lim Var

    ee lim Var 0 n n K n K

    This result implies therefore that If plim b then plim s 2 2 Before proceeding it is useful to pursue this result The normalization tr n implies that 2 2 1 n i2
    i

    and

    i

    i2 2

    Therefore our previous convergence result implies that the least squares estimator s 2 converges to plim 2 that is the probability limit of the average variance of the disturbances assuming that this probability limit exists Thus some further assumption

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    219

    about these variances is necessary to obtain the result For an application see Exercise 5 in Chapter 13 The difference between the conventional estimator and the appropriate true covariance matrix for b is Est Var b X Var b X s 2 X X 1 2 X X 1 X
    1

    X X X 1
    1

    11 6

    In a large sample so that s 2 2 this difference is approximately equal to D 2 n
    n i 1

    XX n

    XX X X n n
    n i 1

    XX n



    11 7

    The difference between the two matrices hinges on XX X X n n 1 xi xi n i 1 xi xi n n
    n

    1 i xi xi 11 8
    i 1

    where xi is the ith row of X These are two weighted averages of the matrices Qi xi xi using weights 1 for the rst term and i for the second The scaling tr n implies that i i n 1 Whether the weighted average based on i n differs much from the one using 1 n depends on the weights If the weights are related to the values in xi then the difference can be considerable If the weights are uncorrelated with xi xi however then the weighted average will tend to equal the unweighted average 3 Therefore the comparison rests on whether the heteroscedasticity is related to any of xk or x j xk The conclusion is that in general If the heteroscedasticity is not correlated with the variables in the model then at least in large samples the ordinary least squares computations although not the optimal way to use the data will not be misleading For example in the groupwise heteroscedasticity model of Section 11 7 2 if the observations are grouped in the subsamples in a way that is unrelated to the variables in X then the usual OLS estimator of Var b will at least in large samples provide a reliable estimate of the appropriate covariance matrix It is worth remembering however that the least squares estimator will be inef cient the more so the larger are the differences among the variances of the groups 4 The preceding is a useful result but one should not be overly optimistic First it remains true that ordinary least squares is demonstrably inef cient Second if the primary assumption of the analysis that the heteroscedasticity is unrelated to the variables in the model is incorrect then the conventional standard errors may be quite far from the appropriate values
    11 2 3 ESTIMATING THE APPROPRIATE COVARIANCE MATRIX FOR ORDINARY LEAST SQUARES

    It is clear from the preceding that heteroscedasticity has some potentially serious implications for inferences based on the results of least squares The application of more
    3 Suppose for example that X contains a single column and that both x and are independent and identically i i distributed random variables Then x x n converges to E xi2 whereas x x n converges to Cov i xi2 E i E xi2 E i 1 so if and x 2 are uncorrelated then the sums have the same probability limit 4 Some general results including analysis of the properties of the estimator based on estimated variances are

    given in Taylor 1977

    Greene 50240

    book

    June 17 2002

    16 21

    220

    CHAPTER 11 Heteroscedasticity

    appropriate estimation techniques requires a detailed formulation of however It may well be that the form of the heteroscedasticity is unknown White 1980 has shown that it is still possible to obtain an appropriate estimator for the variance of the least squares estimator even if the heteroscedasticity is related to the variables in X The White estimator see 10 14 in Section 10 35 Est Asy Var b 1 n XX n
    1

    1 n

    n

    ei2 xi xi
    i 1

    XX n

    1



    11 9

    where ei is the ith least squares residual can be used as an estimate of the asymptotic variance of the least squares estimator A number of studies have sought to improve on the White estimator for OLS 6 The asymptotic properties of the estimator are unambiguous but its usefulness in small samples is open to question The possible problems stem from the general result that the squared OLS residuals tend to underestimate the squares of the true disturbances That is why we use 1 n K rather than 1 n in computing s 2 The end result is that in small samples at least as suggested by some Monte Carlo studies e g MacKinnon and White 1985 the White estimator is a bit too optimistic the matrix is a bit too small so asymptotic t ratios are a little too large Davidson and MacKinnon 1993 p 554 suggest a number of xes which include 1 scaling up the end result by a factor n n K and 2 using the squared residual scaled by its true variance ei2 mii instead of ei2 where mii 1 xi X X 1 xi 7 See 4 20 On the basis of their study Davidson and MacKinnon strongly advocate one or the other correction Their admonition One should never use the White estimator because 2 always performs better seems a bit strong but the point is well taken The use of sharp asymptotic results in small samples can be problematic The last two rows of Table 11 1 show the recomputed standard errors with these two modi cations
    Example 11 2 The White Estimator

    Using White s estimator for the regression in Example 11 1 produces the results in the row labeled White S E in Table 11 1 The two income coef cients are individually and jointly statistically signi cant based on the individual t ratios and F 2 67 0 244 0 064 2 0 776 72 5 7 771 The 1 percent critical value is 4 94 The differences in the estimated standard errors seem fairly minor given the extreme heteroscedasticity One surprise is the decline in the standard error of the age coef cient The F test is no longer available for testing the joint signi cance of the two income coef cients because it relies on homoscedasticity A Wald test however may be used in any event The chi squared test is based on W Rb R Est Asy Var b R
    1

    Rb

    where R

    0 0

    0 0

    0 0

    1 0

    0 1

    and the estimated asymptotic covariance matrix is the White estimator The F statistic based on least squares is 7 771 The Wald statistic based on the White estimator is 20 604 the 95 percent critical value for the chi squared distribution with two degrees of freedom is 5 99 so the conclusion is unchanged
    5 See 6 See

    also Eicker 1967 Horn Horn and Duncan 1975 and MacKinnon and White 1985 e g MacKinnon and White 1985 and Messer and White 1984

    7 They

    2 also suggest a third correction ei2 mii as an approximation to an estimator based on the jackknife technique but their advocacy of this estimator is much weaker than that of the other two

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    221

    TABLE 11 1

    Least Squares Regression Results
    Constant Age OwnRent Income Income 2

    Sample Mean Coef cient Standard Error t ratio White S E D and M 1 D and M 2

    237 15 199 35 1 10 212 99 270 79 221 09

    32 08 3 0818 5 5147 0 5590 3 3017 3 4227 3 4477

    0 36 27 941 82 922 0 337 92 188 95 566 95 632

    3 369 234 35 80 366 2 916 88 866 92 122 92 083

    14 997 7 4693 2 008 6 9446 7 1991 7 1995

    R2 0 243578 s 284 75080 Mean Expenditure 189 02 Income is 10 000 Tests for Heteroscedasticity White 14 329 Goldfeld Quandt 15 001 Breusch Pagan 41 920 Koenker Bassett 6 187 2 Two degrees of freedom 5 99

    11 3

    GMM ESTIMATION OF THE HETEROSCEDASTIC REGRESSION MODEL

    The GMM estimator in the heteroscedastic regression model is produced by the empirical moment equations 1 n
    n

    xi yi xi GMM
    i 1

    1 X GMM m GMM 0 n

    11 10

    The estimator is obtained by minimizing q m GMM Wm GMM where W is a positive de nite weighting matrix The optimal weighting matrix would be 1 W Asy Var n m which is the inverse of 1 Asy Var nm Asy Var n
    n n

    xi i plim
    i 1 n

    1 n
    2

    2 i xi xi 2 Q
    i 1

    see 11 1 The optimal weighting matrix would be Q 1 But recall that this minimization problem is an exactly identi ed case so the weighting matrix is irrelevant to the solution You can see that in the moment equation that equation is simply the normal equations for least squares We can solve the moment equations exactly so there is no need for the weighting matrix Regardless of the covariance matrix of the moments the GMM estimator for the heteroscedastic regression model is ordinary least squares This is Case 2 analyzed in Section 10 4 We can use the results we have already obtained to nd its asymptotic covariance matrix The result appears in Section 11 2 The implied estimator is the White estimator in 11 9 Once again see Theorem 10 6 The conclusion to be drawn at this point is that until we make some speci c assumptions about the variances we do not have a more ef cient estimator than least squares but we do have to modify the estimated asymptotic covariance matrix

    Greene 50240

    book

    June 17 2002

    16 21

    222

    CHAPTER 11 Heteroscedasticity

    11 4

    TESTING FOR HETEROSCEDASTICITY

    Heteroscedasticity poses potentially severe problems for inferences based on least squares One can rarely be certain that the disturbances are heteroscedastic however and unfortunately what form the heteroscedasticity takes if they are As such it is useful to be able to test for homoscedasticity and if necessary modify our estimation procedures accordingly 8 Several types of tests have been suggested They can be roughly grouped in descending order in terms of their generality and as might be expected in ascending order in terms of their power 9 Most of the tests for heteroscedasticity are based on the following strategy Ordinary least squares is a consistent estimator of even in the presence of heteroscedasticity As such the ordinary least squares residuals will mimic albeit imperfectly because of sampling variability the heteroscedasticity of the true disturbances Therefore tests designed to detect heteroscedasticity will in most cases be applied to the ordinary least squares residuals
    11 4 1 WHITE S GENERAL TEST

    To formulate most of the available tests it is necessary to specify at least in rough terms the nature of the heteroscedasticity It would be desirable to be able to test a general hypothesis of the form H0 i2 2 H1 Not H0 In view of our earlier ndings on the dif culty of estimation in a model with n unknown parameters this is rather ambitious Nonetheless such a test has been devised by White 1980b The correct covariance matrix for the least squares estimator is Var b X 2 X X 1 X X X X 1 11 11 for all i

    which as we have seen can be estimated using 11 9 The conventional estimator is V s 2 X X 1 If there is no heteroscedasticity then V will give a consistent estimator of Var b X whereas if there is then it will not White has devised a statistical test based on this observation A simple operational version of his test is carried out by obtaining nR2 in the regression of ei2 on a constant and all unique variables contained in x and all the squares and cross products of the variables in x The statistic is asymptotically distributed as chi squared with P 1 degrees of freedom where P is the number of regressors in the equation including the constant The White test is extremely general To carry it out we need not make any speci c assumptions about the nature of the heteroscedasticity Although this characteristic is a virtue it is at the same time a potentially serious shortcoming The test may reveal
    is the possibility that a preliminary test for heteroscedasticity will incorrectly lead us to use weighted least squares or fail to alert us to heteroscedasticity and lead us improperly to use ordinary least squares Some limited results on the properties of the resulting estimator are given by Ohtani and Toyoda 1980 Their results suggest that it is best to test rst for heteroscedasticity rather than merely to assume that it is present
    9A 8 There

    study that examines the power of several tests for heteroscedasticity is Ali and Giaccotto 1984

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    223

    heteroscedasticity but it may instead simply identify some other speci cation error such as the omission of x 2 from a simple regression 10 Except in the context of a speci c problem little can be said about the power of White s test it may be very low against some alternatives In addition unlike some of the other tests we shall discuss the White test is nonconstructive If we reject the null hypothesis then the result of the test gives no indication of what to do next
    11 4 2 THE GOLDFELD QUANDT TEST

    By narrowing our focus somewhat we can obtain a more powerful test Two tests that are relatively general are the Goldfeld Quandt 1965 test and the Breusch Pagan 1979 Lagrange multiplier test For the Goldfeld Quandt test we assume that the observations can be divided into two groups in such a way that under the hypothesis of homoscedasticity the disturbance variances would be the same in the two groups whereas under the alternative the disturbance variances would differ systematically The most favorable case for this would be the groupwise heteroscedastic model of Section 11 7 2 and Example 11 7 or a model such as i2 2 xi2 for some variable x By ranking the observations based on this x we can separate the observations into those with high and low variances The test is applied by dividing the sample into two groups with n1 and n2 observations To obtain statistically independent variance estimators the regression is then estimated separately with the two sets of observations The test statistic is e e1 n1 K F n1 K n2 K 1 11 12 e2 e2 n2 K where we assume that the disturbance variance is larger in the rst sample If not then reverse the subscripts Under the null hypothesis of homoscedasticity this statistic has an F distribution with n1 K and n2 K degrees of freedom The sample value can be referred to the standard F table to carry out the test with a large value leading to rejection of the null hypothesis To increase the power of the test Goldfeld and Quandt suggest that a number of observations in the middle of the sample be omitted The more observations that are dropped however the smaller the degrees of freedom for estimation in each group will be which will tend to diminish the power of the test As a consequence the choice of how many central observations to drop is largely subjective Evidence by Harvey and Phillips 1974 suggests that no more than a third of the observations should be dropped If the disturbances are normally distributed then the Goldfeld Quandt statistic is exactly distributed as F under the null hypothesis and the nominal size of the test is correct If not then the F distribution is only approximate and some alternative method with known large sample properties such as White s test might be preferable
    11 4 3 THE BREUSCH PAGAN GODFREY LM TEST

    The Goldfeld Quandt test has been found to be reasonably powerful when we are able to identify correctly the variable to use in the sample separation This requirement does limit its generality however For example several of the models we will consider allow
    10 Thursby

    1982 considers this issue in detail

    Greene 50240

    book

    June 17 2002

    16 21

    224

    CHAPTER 11 Heteroscedasticity

    the disturbance variance to vary with a set of regressors Breusch and Pagan11 have devised a Lagrange multiplier test of the hypothesis that i2 2 f 0 zi where zi is a vector of independent variables 12 The model is homoscedastic if 0 The test can be carried out with a simple regression LM
    1 2

    explained sum of squares in the regression of ei2 e e n on zi

    For computational purposes let Z be the n P matrix of observations on 1 zi and let g be the vector of observations of gi ei2 e e n 1 Then LM 1 g Z Z Z 1 Z g 2 Under the null hypothesis of homoscedasticity LM has a limiting chi squared distribution with degrees of freedom equal to the number of variables in zi This test can be applied to a variety of models including for example those examined in Example 11 3 3 and in Section 11 7 13 It has been argued that the Breusch Pagan Lagrange multiplier test is sensitive to the assumption of normality Koenker 1981 and Koenker and Bassett 1982 suggest that the computation of LM be based on a more robust estimator of the variance of i2 V 1 n
    n

    ei2
    i 1

    ee n

    2



    The variance of i2 is not necessarily equal to 2 4 if i is not normally distributed Let u 22 2 equal e1 e2 en and i be an n 1 column of 1s Then u e e n With this change the computation becomes LM 1 u u i Z Z Z 1 Z u u i V

    Under normality this modi ed statistic will have the same asymptotic distribution as the Breusch Pagan statistic but absent normality there is some evidence that it provides a more powerful test Waldman 1983 has shown that if the variables in zi are the same as those used for the White test described earlier then the two tests are algebraically the same
    Example 11 3

    1 White s Test For the data used in Example 11 1 there are 15 variables in x x including the constant term But since Ownrent2 OwnRent and Income Income Income2 only 13 are unique Regression of the squared least squares residuals on these 13 variables produces R2 0 199013 The chi squared statistic is therefore 72 0 199013 14 329 The 95 percent critical value of chi squared with 12 degrees of freedom is 21 03 so despite what might seem to be obvious in Figure 11 1 the hypothesis of homoscedasticity is not rejected by this test 2 Goldfeld Quandt Test The 72 observations are sorted by Income and then the regression is computed with the rst 36 observations and the second The two sums of squares are 326 427 and 4 894 130 so the test statistic is F 31 31 4 894 130 326 427 15 001 The critical value from this table is 1 79 so this test reaches the opposite conclusion
    and Pagan 1979 multiplier tests are discussed in Section 17 5 3

    Testing for Heteroscedasticity

    11 Breusch

    12 Lagrange

    2 exp zi is one of these cases In analyzing this model speci cally Harvey 1976 derived the same test statistic

    13 The model 2 i

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    225

    3 Breusch Pagan Test This test requires a speci c alternative hypothesis For this purpose we specify the test based on z 1 Income IncomeSq Using the least squares residuals we compute gi ei2 e e 72 1 then LM 1 g Z Z Z 1 Z g The sum of squares 2 is 5 432 562 033 The computation produces LM 41 920 The critical value for the chisquared distribution with two degrees of freedom is 5 99 so the hypothesis of homoscedasticity is rejected The Koenker and Bassett variant of this statistic is only 6 187 which is still signi cant but much smaller than the LM statistic The wide difference between these two statistics suggests that the assumption of normality is erroneous Absent any knowledge of the heteroscedasticity we might use the Bera and Jarque 1981 1982 and Kiefer and Salmon 1983 test for normality 2 2 n m3 s3 2 m4 3 s4 2 where m j 1 n i ei Under the null hypothesis of homoscedastic and normally distributed disturbances this statistic has a limiting chi squared distribution with two degrees of freedom Based on the least squares residuals the value is 482 12 which certainly does lead to rejection of the hypothesis Some caution is warranted here however It is unclear what part of the hypothesis should be rejected We have convincing evidence in Figure 11 1 that the disturbances are heteroscedastic so the assumption of homoscedasticity underlying this test is questionable This does suggest the need to examine the data before applying a speci cation test such as this one
    j

    11 5

    WEIGHTED LEAST SQUARES WHEN I S KNOWN

    Having tested for and found evidence of heteroscedasticity the logical next step is to revise the estimation technique to account for it The GLS estimator is X
    1

    X 1 X

    1

    y

    Consider the most general case Var i xi i2 2 i Then 1 is a diagonal matrix whose i th diagonal element is 1 i The GLS estimator is obtained by regressing x1 1 y1 1 x y 2 2 2 2 on PX Py yn n xn n Applying ordinary least squares to the transformed model we obtain the weighted least squares WLS estimator
    n 1 n


    i 1

    wi xi xi
    i 1

    wi xi yi

    11 13

    where wi 1 i 14 The logic of the computation is that observations with smaller variances receive a larger weight in the computations of the sums and therefore have greater in uence in the estimates obtained
    14 The weights are often denoted w 1 2 This expression is consistent with the equivalent i i X 2 1 X 1 X 2 1 y The 2 s cancel leaving the expression given previously

    Greene 50240

    book

    June 17 2002

    16 21

    226

    CHAPTER 11 Heteroscedasticity

    A common speci cation is that the variance is proportional to one of the regressors or its square Our earlier example of family expenditures is one in which the relevant variable is usually income Similarly in studies of rm pro ts the dominant variable is typically assumed to be rm size If
    2 i2 2 xik

    then the transformed regression model for GLS is y k 1 xk x1 xk 2 x2 xk xk 11 14

    2 If the variance is proportional to xk instead of xk then the weight applied to each observation is 1 xk instead of 1 xk In 11 14 the coef cient on xk becomes the constant term But if the variance is proportional to any power of xk other than two then the transformed model will no longer contain a constant and we encounter the problem of interpreting R2 mentioned earlier For example no conclusion should be drawn if the R2 in the regression of y z on 1 z and x z is higher than in the regression of y on a constant and x for any z including x The good t of the weighted regression might be due to the presence of 1 z on both sides of the equality It is rarely possible to be certain about the nature of the heteroscedasticity in a regression model In one respect this problem is only minor The weighted least squares estimator n 1 n


    i 1

    wi xi xi
    i 1

    wi xi yi

    is consistent regardless of the weights used as long as the weights are uncorrelated with the disturbances But using the wrong set of weights has two other consequences that may be less benign First the improperly weighted least squares estimator is inef cient This point might be moot if the correct weights are unknown but the GLS standard errors will also be incorrect The asymptotic covariance matrix of the estimator X V 1 X 1 X V 1 y is Asy Var 2 X V 1 X 1 X V 1 V 1 X X V 1 X 1 11 16 11 15

    This result may or may not resemble the usual estimator which would be the matrix in brackets and underscores the usefulness of the White estimator in 11 9 The standard approach in the literature is to use OLS with the White estimator or some variant for the asymptotic covariance matrix One could argue both aws and virtues in this approach In its favor robustness to unknown heteroscedasticity is a compelling virtue In the clear presence of heteroscedasticity however least squares can be extremely inef cient The question becomes whether using the wrong weights is better than using no weights at all There are several layers to the question If we use one of the models discussed earlier Harvey s for example is a versatile and exible candidate then we may use the wrong set of weights and in addition estimation of

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    227

    the variance parameters introduces a new source of variation into the slope estimators for the model A heteroscedasticity robust estimator for weighted least squares can be formed by combining 11 16 with the White estimator The weighted least squares estimator in 11 15 is consistent with any set of weights V diag v1 v2 vn Its asymptotic covariance matrix can be estimated with
    n

    Est Asy Var X V 1 X 1
    i 1

    ei2 vi2

    xi xi X V 1 X 1

    11 17

    Any consistent estimator can be used to form the residuals The weighted least squares estimator is a natural candidate

    11 6

    ESTIMATION WHEN PARAMETERS

    C ONTAINS UNKNOWN

    The general form of the heteroscedastic regression model has too many parameters to estimate by ordinary methods Typically the model is restricted by formulating 2 as a function of a few parameters as in i2 2 xi or i2 2 xi 2 Write this as FGLS based on a consistent estimator of meaning a consistent estimator of is asymptotically equivalent to full GLS and FGLS based on a maximum likelihood estimator of will produce a maximum likelihood estimator of if does not contain any elements of The new problem is that we must rst nd consistent estimators of the unknown parameters in Two methods are typically used twostep GLS and maximum likelihood
    11 6 1 TWO STEP ESTIMATION

    For the heteroscedastic model the GLS estimator is
    n


    i 1

    1 xi xi i2

    1

    n i 1

    1 xi yi i2

    11 18

    The two step estimators are computed by rst obtaining estimates i2 usually using some function of the ordinary least squares residuals Then uses 11 18 and i2 The ordinary least squares estimator of although inef cient is still consistent As such statistics computed using the ordinary least squares residuals ei yi xi b will have the same asymptotic properties as those computed using the true disturbances i yi xi This result suggests a regression approach for the true disturbances and variables zi that may or may not coincide with xi Now E i2 zi i2 so i2 i2 vi where vi is just the difference between i2 and its conditional expectation Since i is unobservable we would use the least squares residual for which ei i xi b p i ui Then ei2 i2 ui2 2 i ui But in large samples as b terms in ui will

    Greene 50240

    book

    June 17 2002

    16 21

    228

    CHAPTER 11 Heteroscedasticity

    become negligible so that at least approximately 15 ei2 i2 vi The procedure suggested is to treat the variance function as a regression and use the squares or some other functions of the least squares residuals as the dependent variable 16 For example if i2 zi then a consistent estimator of will be the least squares slopes a in the model ei2 zi vi In this model vi is both heteroscedastic and autocorrelated so a is consistent but inef cient But consistency is all that is required for asymptotically ef cient estimation of using It remains to be settled whether improving the estimator of in this and the other models we will consider would improve the small sample properties of the two step estimator of 17 The two step estimator may be iterated by recomputing the residuals after computing the FGLS estimates and then reentering the computation The asymptotic properties of the iterated estimator are the same as those of the two step estimator however In some cases this sort of iteration will produce the maximum likelihood estimator at convergence Yet none of the estimators based on regression of squared residuals on other variables satisfy the requirement Thus iteration in this context provides little additional bene t if any
    11 6 2 MAXIMUM LIKELIHOOD ESTIMATION18

    The log likelihood function for a sample of normally distributed observations is n 1 ln L ln 2 2 2 For simplicity let i2 2 fi where is the vector of unknown parameters in and fi is indexed by i to indicate that it is a function of zi note that diag fi so it is also Assume as well that no elements of appear in The log likelihood function is n 1 ln L ln 2 ln 2 2 2
    n n

    ln i2
    i 1

    1 yi xi 2 i2 11 19

    ln fi
    i 1

    1 2

    1 yi xi 2 fi

    For convenience in what follows substitute i for yi xi denote fi as simply fi and denote the vector of derivatives fi as g i Then the derivatives of the
    15 See 16 See

    Amemiya 1985 for formal analysis for example Jobson and Fuller 1980 method of maximum likelihood estimation is developed in Chapter 17

    17 Fomby Hill and Johnson 1984 pp 177 186 and Amemiya 1985 pp 203 207 1977a examine this model 18 The

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    229

    log likelihood function are ln L
    n

    xi
    i 1

    i 2 fi
    n i 1 n i 1

    ln L n 1 2 2 2 2 4 ln L
    n i 1

    i2 fi

    1 2 2

    i2 1 2 fi

    11 20

    1 2

    i2 1 2 fi

    1 gi fi

    Since E i xi zi 0 and E i2 xi zi 2 fi it is clear that all derivatives have expectation zero as required The maximum likelihood estimators are those values of 2 and that simultaneously equate these derivatives to zero The likelihood equations are generally highly nonlinear and will usually require an iterative solution Let G be the n M matrix with i th row equal to fi gi and let i denote an n 1 column vector of 1s The asymptotic covariance matrix for the maximum likelihood estimator in this model is 1 1 1 2 X 1 X 0 0 2 ln L E 0 n 2 4 1 2 2 i 1 G 0 1 2 2 G 1 i 1 2 G 2 G 11 21 where 2 One convenience is that terms involving 2 fi fall out of the expectations The proof is considered in the exercises From the likelihood equations it is apparent that for a given value of the solution for is the GLS estimator The scale parameter 2 is ultimately irrelevant to this solution The second likelihood equation shows that for given values of and 2 will be estimated as the mean of the squared generalized residuals 2 1 n in 1 yi xi f i 2 This term is the generalized sum of squares Finally there is no general solution to be found for the estimator of it depends on the model We will examine two examples If is only a single parameter then it may be simplest just to scan a range of values of to locate the one that with the associated FGLS estimator of maximizes the log likelihood The fact that the Hessian is block diagonal does provide an additional convenience The parameter vector may always be estimated conditionally on 2 and likewise if is given then the solutions for 2 and can be found conditionally although this may be a complicated optimization problem But by going back and forth in this fashion as suggested by Oberhofer and Kmenta 1974 we may be able to obtain the full solution more easily than by approaching the full set of equations simultaneously
    11 6 3 MODEL BASED TESTS FOR HETEROSCEDASTICITY

    The tests for heteroscedasticity described in Section 11 4 are based on the behavior of the least squares residuals The general approach is based on the idea that if heteroscedasticity of any form is present in the disturbances it will be discernible in the behavior of the residuals Those residual based tests are robust in the sense that they

    Greene 50240

    book

    June 17 2002

    16 21

    230

    CHAPTER 11 Heteroscedasticity

    will detect heteroscedasticity of a variety of forms On the other hand their power is a function of the speci c alternative The model considered here is fairly narrow The tradeoff is that within the context of the speci ed model a test of heteroscedasticity will have greater power than the residual based tests To come full circle of course that means that if the model speci cation is incorrect the tests are likely to have limited or no power at all to reveal an incorrect hypothesis of homoscedasticity Testing the hypothesis of homoscedasticity using any of the three standard methods is particularly simple in the model outlined in this section The trio of tests for parametric models is available The model would generally be formulated so that the heteroscedasticity is induced by a nonzero Thus we take the test of H0 0 to be a test against homoscedasticity Wald Test The Wald statistic is computed by extracting from the full parameter vector and its estimated asymptotic covariance matrix the subvector and its asymptotic covariance matrix Then W Est Asy Var
    1



    Likelihood Ratio Test The results of the homoscedastic least squares regression are generally used to obtain the initial values for the iterations The restricted log likelihood value is a by product of the initial setup log LR n 2 1 ln 2 ln e e n The unrestricted log likelihood log LU is obtained as the objective function for the estimation Then the statistic for the test is LR 2 ln LR ln LU Lagrange Multiplier Test To set up the LM test we refer back to the model in 11 19 11 21 At the restricted estimates 0 b 2 e e n not n K fi 1 and 0 I Thus the rst derivatives vector evaluated at the least squares estimates is ln L b 2 e e n 0 0 ln L b 2 e e n 0 0 2 ln L b 2 e e n 0
    n i 1

    1 2

    ei2 1 gi e e n

    n i 1

    1 vi gi 2

    The negative expected inverse of the Hessian from 11 21 is 1 1 0 0 1 2 X X 2 ln L n 2 4 1 2 2 g E H 0 E 0 0 1 2 2 g 1 2 G G where g LM
    n i 1

    1

    gi and G G

    n i 1

    gi gi The LM statistic will be E H
    1

    ln L b e e n 0

    ln L b e e n 0

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    231

    With a bit of algebra and using B 66 for the partitioned inverse you can show that this reduces to LM 1 2
    n n 1 n

    vi gi
    i 1 i 1

    gi g gi g
    i 1

    vi gi



    This result as given by Breusch and Pagan 1980 is simply one half times the regression sum of squares in the regression of vi on a constant and gi This actually simpli es even further if as in the cases studied by Bruesch and Pagan the variance function is fi f zi where f zi 0 1 Then the derivative will be of the form gi r zi zi and it will follow that ri zi 0 a constant In this instance the same statistic will result from the regression of vi on a constant and zi which is the result reported in Section 11 4 3 The remarkable aspect of the result is that the same statistic results regardless of the choice of variance function so long as it satis es fi f zi where f zi 0 1 The model studied by Harvey for example has fi exp zi so gi zi when 0
    Example 11 4 Two Step Estimation of a Heteroscedastic Regression

    Table 11 2 lists weighted least squares and two step FGLS estimates of the parameters of the regression model in Example 11 1 using various formulations of the scedastic function The method used to compute the weights for weighted least squares is given below each model formulation The procedure was iterated to convergence for the model i2 2 zi convergence required 13 iterations The two step estimates are those computed by the rst iteration ML estimates for this model are also shown As often happens the iteration produces fairly large changes in the estimates There is also a considerable amount of variation produced by the different formulations For the model fi zi the concentrated log likelihood is simple to compute We can nd the maximum likelihood estimate for this model just by scanning over a range of values for For any the maximum likelihood estimator of is weighted least squares with weights wi 1 zi For our expenditure model we use income for zi Figure 11 2 shows a plot of the log likelihood function The maximum occurs at 3 65 This value with the FGLS estimates of is shown in Table 11 2

    TABLE 11 2

    Two Step and Weighted Least Squares Estimates
    Constant Age OwnRent Income Income2

    i2 2 OLS i2 2 Ii WLS i2 2 Ii2 WLS i2 2 exp zi ln ei2 on zi 1 ln Ii i2 2 zi 2 Step ln ei2 on 1 ln zi iterated 1 7623 ML 3 6513

    est s e est s e est s e est s e est s e est s e est s e

    237 15 199 35 181 87 165 52 114 11 139 69 117 88 101 39 193 33 171 08 130 38 145 03 19 929 113 06

    3 0818 5 5147 2 9350 4 6033 2 6942 3 8074 1 2337 2 5512 2 9579 4 7627 2 7754 3 9817 1 7058 2 7581

    27 941 82 922 50 494 69 879 60 449 58 551 50 950 52 814 47 357 72 139 59 126 61 0434 58 102 43 5084

    234 35 80 366 202 17 76 781 158 43 76 392 145 30 46 363 208 86 77 198 169 74 76 180 75 970 81 040

    14 997 7 4693 12 114 8 2731 7 2492 9 7243 7 9383 3 7367 12 769 8 0838 8 5995 9 3133 4 3915 13 433

    Greene 50240

    book

    June 17 2002

    16 21

    232

    CHAPTER 11 Heteroscedasticity

    480

    LOGLHREG

    490

    500

    510 1
    FIGURE 11 2

    0

    1

    2 ALPHA

    3

    4

    5

    6

    Plot of Log Likelihood Function

    Note that this value of is very different from the value we obtained by iterative regression of the logs of the squared residuals on log income In this model gi fi ln zi If we insert this into the expression for ln L and manipulate it a bit we obtain the implicit solution
    n

    i 1 1 2

    i2 1 ln zi 0 2 zi

    The disappears from the solution For given values of 2 and this result provides only an implicit solution for In the next section we examine a method for nding a solution At this point we note that the solution to this equation is clearly not obtained by regression of the logs of the squared residuals on log zi Hence the strategy we used for the two step estimator does not seek the maximum likelihood estimator

    11 7

    APPLICATIONS

    This section will present two common applications of the heteroscedastic regression model Harvey s model of multiplicative heteroscedasticity and a model of groupwise heteroscedasticity that extends to the disturbance variance some concepts that are usually associated with variation in the regression function
    11 7 1 MULTIPLICATIVE HETEROSCEDASTICITY

    Harvey s 1976 model of multiplicative heteroscedasticity is a very exible general model that includes most of the useful formulations as special cases The general formulation is i2 2 exp zi

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    233

    The model examined in Example 11 4 has zi ln incomei More generally a model with heteroscedasticity of the form
    M

    i2 2
    m 1

    m zim

    results if the logs of the variables are placed in zi The groupwise heteroscedasticity model described below is produced by making zi a set of group dummy variables one must be omitted In this case 2 is the disturbance variance for the base group whereas 2 for the other groups g 2 exp g We begin with a useful simpli cation Let zi include a constant term so that zi 1 qi where qi is the original set of variables and let ln 2 Then the model is simply i2 exp zi Once the full parameter vector is estimated exp 1 provides the estimator of 2 This estimator uses the invariance result for maximum likelihood estimation See Section 17 4 5 d The log likelihood is n 1 ln L ln 2 2 2 1 n ln 2 2 2 The likelihood equations are ln L
    n n

    ln i2
    i 1 n

    1 2

    n i 1 n

    i2 i2

    zi
    i 1

    1 2

    i 1

    i2 exp zi

    xi
    i 1 n

    i X exp zi zi

    1

    0

    ln L 1 2

    i 1

    i2 1 exp zi

    0

    For this model the method of scoring turns out to be a particularly convenient way to maximize the log likelihood function The terms in the Hessian are 2 ln L 2 ln L
    n i 1 n i 1 n i 1

    1 xi x X exp zi i i xi z exp zi i i2 zi z exp zi i

    1

    X

    2 ln L 1 2
    2

    The expected value of ln L is 0 since E i xi zi 0 The expected value of the fraction in 2 ln L is E i2 i2 xi zi 1 Let Then E 2 ln L X
    1

    X

    0
    1 Z 2

    0

    Z

    H

    Greene 50240

    book

    June 17 2002

    16 21

    234

    CHAPTER 11 Heteroscedasticity

    The scoring method is t 1 t H 1 gt t where t i e t t and t is the estimate at iteration t gt is the two part vector of rst derivatives ln L t ln L t and Ht is partitioned likewise Since Ht is block diagonal the iteration can be written as separate equations t 1 t X t X X
    1 1 t X X 1 1 t X X 1 ty 1 t t 1 t y

    X t

    1 1 t X X

    of course

    Therefore the updated coef cient vector t 1 is computed by FGLS using the previously computed estimate of to compute We use the same approach for t 1 t 2 Z Z 1 1 2
    n

    zi
    i 1

    i2 1 exp zi



    The 2 and 1 cancel The updated value of is computed by adding the vector of slopes 2 in the least squares regression of i2 exp zi 1 on zi to the old one Note that the correction is 2 Z Z 1 Z ln L so convergence occurs when the derivative is zero The remaining detail is to determine the starting value for the iteration Since any consistent estimator will do the simplest procedure is to use OLS for and the slopes in a regression of the logs of the squares of the least squares residuals on zi for Harvey 1976 shows that this method will produce an inconsistent estimator of 1 ln 2 but the inconsistency can be corrected just by adding 1 2704 to the value obtained 19 Thereafter the iteration is simply 1 2 3 4 Estimate the disturbance variance i2 with exp t zi Compute t 1 by FGLS 20 Update t using the regression described in the preceding paragraph Compute dt 1 t 1 t 1 t t If dt 1 is large then return to step 1

    If dt 1 at step 4 is suf ciently small then exit the iteration The asymptotic covariance matrix is simply H 1 which is block diagonal with blocks Asy Var ML X
    1

    X 1

    Asy Var ML 2 Z Z 1 If desired then 2 exp 1 can be computed The asymptotic variance would be exp 1 2 Asy Var 1 ML
    19 He

    also presents a correction for the asymptotic covariance matrix for this rst step estimator of

    20 The

    two step estimator obtained by stopping here would be fully ef cient if the starting value for were consistent but it would not be the maximum likelihood estimator

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    235

    TABLE 11 3

    Multiplicative Heteroscedasticity Model
    Constant Age OwnRent Income Income2

    Ordinary Least Squares Estimates Coef cient 237 15 3 0818 Standard error 199 35 5 5147 t ratio 1 1 0 559 R2 0 243578 s 284 75080 Ln L 506 488

    27 941 82 922 0 337

    234 35 80 366 2 916

    14 997 7 469 2 008

    Maximum Likelihood Estimates standard errors for estimates of in parentheses Coef cient 58 437 0 37607 33 358 96 823 3 3008 Standard error 62 098 0 55000 37 135 31 798 2 6248 t ratio 0 941 0 684 0 898 3 045 1 448 exp c1 1 2 0 9792 0 79115 c2 5 355 0 37504 c3 0 56315 0 036122 Ln L 465 9817 Wald 251 423 LR 81 0142 LM 115 899
    Example 11 5 Multiplicative Heteroscedasticity

    Estimates of the regression model of Example 11 1 based on Harvey s model are shown in Table 11 3 with the ordinary least squares results The scedastic function is i2 exp 1 2 incomei 3 incomei2 The estimates are consistent with the earlier results in suggesting that Income and its square signi cantly explain variation in the disturbance variances across observations The 95 percent critical value for a chi squared test with two degrees of freedom is 5 99 so all three test statistics lead to rejection of the hypothesis of homoscedasticity
    11 7 2 GROUPWISE HETEROSCEDASTICITY

    A groupwise heteroscedastic regression has structural equations yi xi i i 1 n E i xi 0 i 1 n The n observations are grouped into G groups each with ng observations The slope vector is the same in all groups but within group g
    2 Var ig xig g i 1 ng

    If the variances are known then the GLS estimator is 1 G G 1 1 Xg Xg 2 2 g g
    g 1 g 1

    Xg yg 11 22

    Since Xg yg Xg Xg bg where bg is the OLS estimator in the g th subset of observations 1 1 G G G G G 1 1 Xg Xg Xg Xg bg Vg Vg bg Wg bg 2 2 g g
    g 1 g 1 g 1 g 1 g 1

    This result is a matrix weighted average of the G least squares estimators The weighting 1 1 1 G matrices are Wg Var bg The estimator with the smaller g 1 Var bg

    Greene 50240

    book

    June 17 2002

    16 21

    236

    CHAPTER 11 Heteroscedasticity

    covariance matrix therefore receives the larger weight If Xg is the same in every group 2 then the matrix Wg reduces to the simple scalar wg hg g hg where hg 1 g The preceding is a useful construction of the estimator but it relies on an algebraic result that might be unusable If the number of observations in any group is smaller than the number of regressors then the group speci c OLS estimator cannot be computed But as can be seen in 11 22 that is not what is needed to proceed what is needed are the weights As always pooled least squares is a consistent estimator which means that using the group speci c subvectors of the OLS residuals g 2 eg eg ng 11 23

    provides the needed estimator for the group speci c disturbance variance Thereafter 11 22 is the estimator and the inverse matrix in that expression gives the estimator of the asymptotic covariance matrix Continuing this line of reasoning one might consider iterating the estimator by returning to 11 23 with the two step FGLS estimator recomputing the weights then returning to 11 22 to recompute the slope vector This can be continued until convergence It can be shown see Oberhofer and Kmenta 1974 that so long as 11 23 is used without a degrees of freedom correction then if this does converge it will do so at the maximum likelihood estimator with normally distributed disturbances Another method of estimating this model is to treat it as a form of Harvey s model of multiplicative heteroscedasticity where zi is a set minus one of group dummy variables For testing the homoscedasticity assumption in this model one can use a likelihood ratio test The log likelihood function assuming homoscedasticity is ln L0 n 2 1 ln 2 ln e e n where n g ng is the total number of observations Under the alternative hypothesis of heteroscedasticity across G groups the log likelihood function is n 1 ln L1 ln 2 2 2
    G 2 ng ln g g 1

    1 2

    G

    ng 2 2 ig g

    11 24

    g 1 i 1

    2 The maximum likelihood estimators of 2 and g are e e n and g from 11 23 respec 2 tively The OLS and maximum likelihood estimators of are used for the slope vector under the null and alternative hypothesis respectively If we evaluate ln L0 and ln L1 at these estimates then the likelihood ratio test statistic for homoscedasticity is G

    2 ln L0 ln L1 n ln s 2
    g 1

    2 ng ln sg

    Under the null hypothesis the statistic has a limiting chi squared distribution with G 1 degrees of freedom
    Example 11 6 Heteroscedastic Cost Function for Airline Production

    To illustrate the computations for the groupwise heteroscedastic model we will reexamine the cost model for the total cost of production in the airline industry that was t in Example 7 2

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    237

    TABLE 11 4

    Least Squares and Maximum Likelihood Estimates of a Groupwise Heteroscedasticity Model
    Maximum Likelihood Estimate Std Error t Ratio

    Least Squares Homoscedastic Estimate Std Error t Ratio

    1 2 3 4 2 3 4 5 6 1 2 3 4 5 6 2 1 2 2 2 3 2 4 2 5 2 6

    9 706 0 418 1 070 0 919 0 0412 0 209 0 185 0 0241 0 0871

    0 193 0152 0 202 0 0299 0 0252 0 0428 0 0608 0 0799 0 0842

    50 25 27 47 5 30 30 76 1 64 4 88 3 04 0 30 1 03

    10 057 0 400 1 129 0 928 0 0487 0 200 0 192 0 0419 0 0963 7 088 2 007 0 758 2 239 0 530 1 053

    0 134 0 0108 0 164 0 0228 0 0237 0 0308 0 0499 0 0594 0 0631 0 365 0 516 0 516 0 516 0 516 0 516 0 0008349 0 006212 0 001781 0 009071 0 001419 0 002393 ln L 140 7591

    74 86 37 12 7 87 40 86 2 06 6 49 3 852 0 71 1 572 19 41 3 89 1 47 4 62 1 03 2 04

    0 001479 0 004935 0 001888 0 005834 0 002338 0 003032 R2 0 997 s 2 0 003613 ln L 130 0862

    A description of the data appears in the earlier example For a sample of six airlines observed annually for 15 years we t the cost function ln costit 1 2 ln outputit 3 load factorit 4 ln fuel priceit 2 Firm2 3 Firm3 4 Firm4 5 Firm5 6 Firm6 it Output is measured in revenue passenger miles The load factor is a rate of capacity utilization it is the average rate at which seats on the airline s planes are lled More complete models of costs include other factor prices materials capital and perhaps a quadratic term in log output to allow for variable economies of scale The rm j terms are rm speci c dummy variables Ordinary least squares regression produces the set of results at the left side of Table 11 4 The variance estimates shown at the bottom of the table are the rm speci c variance estimates in 11 23 The results so far are what one might expect There are substantial economies of scale e s it 1 0 919 1 0 088 The fuel price and load factors affect costs in the predictable fashions as well Fuel prices differ because of different mixes of types and regional differences in supply characteristics The second set of results shows the model of groupwise heteroscedasticity From the least squares variance estimates in the rst set of results which are quite different one might guess that a test of homoscedasticity would lead to rejection of the hypothesis The easiest computation is the likelihood ratio test Based on the log likelihood functions in the last row of the table the test statistic which has a limiting chi squared distribution with 5 degrees of freedom equals 21 3458 The critical value from the table is 11 07 so the hypothesis of homoscedasticity is rejected

    Greene 50240

    book

    June 17 2002

    16 21

    238

    CHAPTER 11 Heteroscedasticity

    11 8

    AUTOREGRESSIVE CONDITIONAL HETEROSCEDASTICITY

    Heteroscedasticity is often associated with cross sectional data whereas time series are usually studied in the context of homoscedastic processes In analyses of macroeconomic data Engle 1982 1983 and Cragg 1982 found evidence that for some kinds of data the disturbance variances in time series models were less stable than usually assumed Engle s results suggested that in models of in ation large and small forecast errors appeared to occur in clusters suggesting a form of heteroscedasticity in which the variance of the forecast error depends on the size of the previous disturbance He suggested the autoregressive conditionally heteroscedastic or ARCH model as an alternative to the usual time series process More recent studies of nancial markets suggest that the phenomenon is quite common The ARCH model has proven to be useful in studying the volatility of in ation Coulson and Robins 1985 the term structure of interest rates Engle Hendry and Trumbull 1985 the volatility of stock market returns Engle Lilien and Robins 1987 and the behavior of foreign exchange markets Domowitz and Hakkio 1985 and Bollerslev and Ghysels 1996 to name but a few This section will describe speci cation estimation and testing in the basic formulations of the ARCH model and some extensions 21
    Example 11 7 Stochastic Volatility

    Figure 11 3 shows Bollerslev and Ghysel s 1974 data on the daily percentage nominal return for the Deutschmark Pound exchange rate These data are given in Appendix Table F11 1 The variation in the series appears to be uctuating with several clusters of large and small movements
    11 8 1 THE ARCH 1 MODEL

    The simplest form of this model is the ARCH 1 model yt xt t t ut 0 1 t2 1 11 25

    where ut is distributed as standard normal 22 It follows that E t xt t 1 0 so that E t xt 0 and E yt xt xt Therefore this model is a classical regression model But Var t t 1 E t2 t 1 E u2 0 1 t2 1 0 1 t2 1 t so t is conditionally heteroscedastic not with respect to xt as we considered in the preceding sections but with respect to t 1 The unconditional variance of t is Var t Var E t t 1 E Var t t 1 0 1 E t2 1 0 1 Var t 1
    21 Engle

    and Rothschild 1992 give a recent survey of this literature which describes many extensions Mills 1993 also presents several applications See as well Bollerslev 1986 and Li Ling and McAleer 2001 See McCullough and Renfro 1999 for discussion of estimation of this model assumption that ut has unit variance is not a restriction The scaling implied by any other variance would be absorbed by the other parameters

    22 The

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    239

    4 3 2 1 Y 0 1 2 3

    0

    395

    790 Observ

    1185

    1580

    1975

    FIGURE 11 3

    Nominal Exchange Rate Returns

    If the process generating the disturbances is weakly covariance stationary see De nition 12 2 23 then the unconditional variance is not changing over time so 0 Var t Var t 1 0 1 Var t 1 1 1 For this ratio to be nite and positive 1 must be less than 1 Then unconditionally t is distributed with mean zero and variance 2 0 1 1 Therefore the model obeys the classical assumptions and ordinary least squares is the most ef cient linear unbiased estimator of But there is a more ef cient nonlinear estimator The log likelihood function for this model is given by Engle 1982 Conditioned on starting values y0 and x0 and 0 the conditional log likelihood for observations t 1 T is the one we examined in Section 11 6 2 for the general heteroscedastic regression model see 11 19 ln L T 1 ln 2 2 2
    T

    ln 0 1 t2 1
    t 1

    1 2

    T t 1

    t2 t yt xt 0 1 t2 1 11 26

    Maximization of log L can be done with the conventional methods as discussed in Appendix E 24
    23 This discussion will draw on the results and terminology of time series analysis in Section 12 3 and Chapter 20

    The reader may wish to peruse this material at this point
    24 Engle

    1982 and Judge et al 1985 pp 441 444 suggest a four step procedure based on the method of scoring that resembles the two step method we used for the multiplicative heteroscedasticity model in Section 11 6 However the full MLE is now incorporated in most modern software so the simple regression based methods which are dif cult to generalize are less attractive in the current literature But see McCullough and Renfro 1999 and Fiorentini Calzolari and Panattoni 1996 for commentary and some cautions related to maximum likelihood estimation

    Greene 50240

    book

    June 17 2002

    16 21

    240

    CHAPTER 11 Heteroscedasticity 11 8 2 ARCH q ARCH IN MEAN AND GENERALIZED ARCH MODELS

    The natural extension of the ARCH 1 model presented before is a more general model with longer lags The ARCH q process t2 0 1 t2 1 2 t2 2 q t2 q is a qth order moving average MA q process Much of the analysis of the model parallels the results in Chapter 20 for more general time series models Once again see Engle 1982 This section will generalize the ARCH q model as suggested by Bollerslev 1986 in the direction of the autoregressive moving average ARMA models of Section 20 2 1 The discussion will parallel his development although many details are omitted for brevity The reader is referred to that paper for background and for some of the less critical details The capital asset pricing model CAPM is discussed brie y in Chapter 14 Among the many variants of this model is an intertemporal formulation by Merton 1980 that suggests an approximate linear relationship between the return and variance of the market portfolio One of the possible aws in this model is its assumption of a constant variance of the market portfolio In this connection then the ARCH in Mean or ARCH M model suggested by Engle Lilien and Robins 1987 is a natural extension The model states that yt xt t2 t Var t
    t

    ARCH q

    Among the interesting implications of this modi cation of the standard model is that under certain assumptions is the coef cient of relative risk aversion The ARCH M model has been applied in a wide variety of studies of volatility in asset returns including the daily Standard and Poor s Index French Schwert and Stambaugh 1987 and weekly New York Stock Exchange returns Chou 1988 A lengthy list of applications is given in Bollerslev Chou and Kroner 1992 The ARCH M model has several noteworthy statistical characteristics Unlike the standard regression model misspeci cation of the variance function does impact on the consistency of estimators of the parameters of the mean See Pagan and Ullah 1988 for formal analysis of this point Recall that in the classical regression setting weighted least squares is consistent even if the weights are misspeci ed as long as the weights are uncorrelated with the disturbances That is not true here If the ARCH part of the model is misspeci ed then conventional estimators of and will not be consistent Bollerslev Chou and Kroner 1992 list a large number of studies that called into question the speci cation of the ARCH M model and they subsequently obtained quite different results after respecifying the model A closely related practical problem is that the mean and variance parameters in this model are no longer uncorrelated In analysis up to this point we made quite pro table use of the block diagonality of the Hessian of the log likelihood function for the model of heteroscedasticity But the Hessian for the ARCH M model is not block diagonal In practical terms the estimation problem cannot be segmented as we have done previously with the heteroscedastic regression model All the parameters must be estimated simultaneously

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    241

    The model of generalized autoregressive conditional heteroscedasticity GARCH is de ned as follows 25 The underlying regression is the usual one in 11 25 Conditioned on an information set at time t denoted t the distribution of the disturbance is assumed to be t where the conditional variance is t2 0 1 t2 1 2 t2 2 p t2 p 1 t2 1 2 t2 2 q t2 q De ne zt 1 t2 1 t2 2 t2 p t2 1 t2 2 t2 q and 0 1 2 p 1 q 0 Then t2 zt Notice that the conditional variance is de ned by an autoregressive moving average ARMA p q process in the innovations t2 exactly as in Section 20 2 1 The difference here is that the mean of the random variable of interest yt is described completely by a heteroscedastic but otherwise ordinary regression model The conditional variance however evolves over time in what might be a very complicated manner depending on the parameter values and on p and q The model in 11 27 is a GARCH p q model where p refers as before to the order of the autoregressive part 26 As Bollerslev 1986 demonstrates with an example the virtue of this approach is that a GARCH model with a small number of terms appears to perform as well as or better than an ARCH model with many The stationarity conditions discussed in Section 20 2 2 are important in this context to ensure that the moments of the normal distribution are nite The reason is that higher moments of the normal distribution are nite powers of the variance A normal distribution with variance t2 has fourth moment 3 t4 sixth moment 15 t6 and so on The precise relationship of the even moments of the normal distribution to the variance is 2k 2 k 2k k 2k Simply ensuring that t2 is stable does not ensure that higher powers are as well 27 Bollerslev presents a useful gure that shows the conditions needed to ensure stability for moments up to order 12 for a GARCH 1 1 model and gives some additional discussion For example for a GARCH 1 1 process for the 2 2 fourth moment to exist 3 1 2 1 1 1 must be less than 1
    25 As

    t

    N 0 t2 11 27

    have most areas in time series econometrics the line of literature on GARCH models has progressed rapidly in recent years and will surely continue to do so We have presented Bollerslev s model in some detail despite many recent extensions not only to introduce the topic as a bridge to the literature but also because it provides a convenient and interesting setting in which to discuss several related topics such as double length regression and pseudo maximum likelihood estimation instead of our in 18 25 and b instead of our in 18 23

    26 We have changed Bollerslev s notation slightly so as not to con ict with our previous presentation He used 27 The

    conditions cannot be imposed a priori In fact there is no nonzero set of parameters that guarantees stability of all moments even though the normal distribution has nite moments of all orders As such the normality assumption must be viewed as an approximation

    Greene 50240

    book

    June 17 2002

    16 21

    242

    CHAPTER 11 Heteroscedasticity

    It is convenient to write 11 27 in terms of polynomials in the lag operator t2 0 D L t2 A L t2 As discussed in Section 20 2 2 the stationarity condition for such an equation is that the roots of the characteristic equation 1 D z 0 must lie outside the unit circle For the present we will assume that this case is true for the model we are considering and that A 1 D 1 1 This assumption is stronger than that needed to ensure stationarity in a higher order autoregressive model which would depend only on D L The implication is that the GARCH process is covariance stationary with E t 0 unconditionally Var t 0 1 A 1 D 1 and Cov t s 0 for all t s Thus unconditionally the model is the classical regression model that we examined in Chapters 2 8 The usefulness of the GARCH speci cation is that it allows the variance to evolve over time in a way that is much more general than the simple speci cation of the ARCH model The comparison between simple nite distributed lag models and the dynamic regression model discussed in Chapter 19 is analogous For the example discussed in his paper Bollerslev reports that although Engle and Kraft s 1983 ARCH 8 model for the rate of in ation in the GNP de ator appears to remove all ARCH effects a closer look reveals GARCH effects at several lags By tting a GARCH 1 1 model to the same data Bollerslev nds that the ARCH effects out to the same eight period lag as t by Engle and Kraft and his observed GARCH effects are all satisfactorily accounted for
    11 8 3 MAXIMUM LIKELIHOOD ESTIMATION OF THE GARCH MODEL

    Bollerslev describes a method of estimation based on the BHHH algorithm As he shows the method is relatively simple although with the line search and rst derivative method that he suggests it probably involves more computation and more iterations than necessary Following the suggestions of Harvey 1976 it turns out that there is a simpler way to estimate the GARCH model that is also very illuminating This model is actually very similar to the more conventional model of multiplicative heteroscedasticity that we examined in Section 11 7 1 For normally distributed disturbances the log likelihood for a sample of T observations is
    T

    ln L
    t 1



    1 2 ln 2 ln t2 t2 2 t

    T

    T

    ln ft
    t 1 t 1

    lt 28

    where t yt xt and 0 Derivatives of ln L are obtained by summation Let lt denote ln ft The rst derivatives with respect to the variance parameters are lt 11 t2 2 t2 t2
    2

    t2 1 2

    1 t2

    t2

    t2 1 t2



    1 2

    1 t2

    g t vt b t vt 11 28

    28 There

    are three minor errors in Bollerslev s derivation that we note here to avoid the apparent inconsistencies In his 22 1 ht should be 1 h 1 In 23 2h 2 should be h 2 In 28 h h should in each t t 2 2t case be 1 h h In his 8 0 1 should be 0 1 but this has no implications for our derivation

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    243

    Note that E vt 0 Suppose for now that there are no regression parameters Newton s method for estimating the variance parameters would be i 1 i H 1 g 11 29

    where H indicates the Hessian and g is the rst derivatives vector Following Harvey s suggestion see Section 11 7 1 we will use the method of scoring instead To do this we make use of E vt 0 and E t2 t2 1 After taking expectations in 11 28 the iter ation reduces to a linear regression of v t 1 2 vt on regressors w t 1 2 gt t2 That is i 1 i W W 1 W v i W W 1 ln L 11 30

    where row t of W is w t The iteration has converged when the slope vector is zero which happens when the rst derivative vector is zero When the iterations are complete the estimated asymptotic covariance matrix is simply Est Asy Var W W 1 based on the estimated parameters The usefulness of the result just given is that E 2 ln L is in fact zero Since the expected Hessian is block diagonal applying the method of scoring to the full parameter vector can proceed in two parts exactly as it did in Section 11 7 1 for the multiplicative heteroscedasticity model That is the updates for the mean and variance parameter vectors can be computed separately Consider then the slope parameters The same type of modi ed scoring method as used earlier produces the iteration
    T



    i 1

    i
    t 1 T

    1 dt xt x t 2 t 2 t2 xt x t 1 dt 2 t 2 t2

    dt t2 dt t2

    1

    T t 1

    1 dt xt t vt 2 t 2 t2 11 31

    1

    i
    t 1

    ln L

    i hi which has been referred to as a double length regression See Orme 1990 and Davidson and MacKinnon 1993 Chapter 14 The update vector hi is the vector of slopes in an augmented or double length generalized regression hi C
    1

    C 1 C

    1

    a

    11 32

    where C is a 2T K matrix whose T rows are the X from the original regression rst model and whose next T rows are 1 2 dt t2 t 1 T a is a 2T 1 vector whose rst T elements are t and whose next T elements are 1 2 vt t2 t 1 T and is a diagonal matrix with 1 t2 in positions 1 T and ones below observation T At convergence C 1 C 1 provides the asymptotic covariance matrix for the MLE The resemblance to the familiar result for the generalized regression model is striking but note that this result is based on the double length regression

    Greene 50240

    book

    June 17 2002

    16 21

    244

    CHAPTER 11 Heteroscedasticity

    The iteration is done simply by computing the update vectors to the current parameters as de ned above 29 An important consideration is that to apply the scoring method the estimates of and are updated simultaneously That is one does not use the updated estimate of in 11 30 to update the weights for the GLS regression to compute the new in 11 31 The same estimates the results of the prior iteration are used on the right hand sides of both 11 30 and 11 31 The remaining problem is to obtain starting values for the iterations One obvious choice is b the OLS estimator for e e T s 2 for 0 and zero for all the remaining parameters The OLS slope vector will be consistent under all speci cations A useful alternative in this context would be to start at the vector of slopes in the least squares regression of et2 the squared OLS residual on a constant and q lagged values 30 As discussed below an LM test for the presence of GARCH effects is then a by product of the rst iteration In principle the updated result of the rst iteration is an ef cient two step estimator of all the parameters But having gone to the full effort to set up the iterations nothing is gained by not iterating to convergence One virtue of allowing the procedure to iterate to convergence is that the resulting log likelihood function can be used in likelihood ratio tests
    11 8 4 TESTING FOR GARCH EFFECTS

    The preceding development appears fairly complicated In fact it is not since at each step nothing more than a linear least squares regression is required The intricate part of the computation is setting up the derivatives On the other hand it does take a fair amount of programming to get this far 31 As Bollerslev suggests it might be useful to test for GARCH effects rst The simplest approach is to examine the squares of the least squares residuals The autocorrelations correlations with lagged values of the squares of the residuals provide evidence about ARCH effects An LM test of ARCH q against the hypothesis of no ARCH effects ARCH 0 the classical model can be carried out by computing 2 TR2 in the regression of et2 on a constant and q lagged values Under the null hypothesis of no ARCH effects the statistic has a limiting chi squared distribution with q degrees of freedom Values larger than the critical table value give evidence of the presence of ARCH or GARCH effects Bollerslev suggests a Lagrange multiplier statistic that is in fact surprisingly simple to compute The LM test for GARCH p 0 against GARCH p q can be carried out by referring T times the R2 in the linear regression de ned in 11 30 to the chi squared critical value with q degrees of freedom There is unfortunately an indeterminacy in this test procedure The test for ARCH q against GARCH p q is exactly the same as that for ARCH p against ARCH p q For carrying out the test one can use as

    29 See 30 A

    Fiorentini et al 1996 on computation of derivatives in GARCH models

    test for the presence of q ARCH effects against none can be carried out by carrying TR2 from this regression into a table of critical values for the chi squared distribution But in the presence of GARCH effects this procedure loses its validity

    31 Since

    this procedure is available as a preprogrammed procedure in many computer programs including TSP E Views Stata RATS LIMDEP and Shazam this warning might itself be overstated

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    245

    TABLE 11 5

    Maximum Likelihood Estimates of a GARCH 1 1 Model32
    0 1 0 1 1

    Estimate Std Error t ratio

    0 006190 0 00873 0 709

    0 01076 0 00312 3 445

    0 1531 0 0273 5 605

    0 8060 0 0302 26 731

    0 2631 0 594 0 443

    ln L 1106 61 ln LOLS 1311 09 y 0 01642 s 2 0 221128

    starting values a set of estimates that includes 0 and any consistent estimators for and Then TR2 for the regression at the initial iteration provides the test statistic 33 A number of recent papers have questioned the use of test statistics based solely on normality Wooldridge 1991 is a useful summary with several examples
    Example 11 8 GARCH Model for Exchange Rate Volatility

    Bollerslev and Ghysels analyzed the exchange rate data in Example 11 7 using a GARCH 1 1 model yt t E t t 1 0 Var t t 1 t2 0 1 t2 1 t2 1 The least squares residuals for this model are simply et yt y Regression of the squares of these residuals on a constant and 10 lagged squared values using observations 11 1974 produces an R2 0 025255 With T 1964 the chi squared statistic is 49 60 which is larger than the critical value from the table of 18 31 We conclude that there is evidence of GARCH effects in these residuals The maximum likelihood estimates of the GARCH model are given in Table 11 5 Note the resemblance between the OLS unconditional variance 0 221128 and the estimated equilibrium variance from the GARCH model 0 2631
    11 8 5 PSEUDO MAXIMUM LIKELIHOOD ESTIMATION

    We now consider an implication of nonnormality of the disturbances Suppose that the assumption of normality is weakened to only E t
    t

    0 E

    t2 t2

    t

    1 E

    t4 t4

    t



    where t2 is as de ned earlier Now the normal log likelihood function is inappropriate In this case the nonlinear ordinary or weighted least squares estimator would have the properties discussed in Chapter 9 It would be more dif cult to compute than the MLE discussed earlier however It has been shown see White 1982a and Weiss 1982 that the pseudo MLE obtained by maximizing the same log likelihood as if it were
    32 These data have become a standard data set for the evaluation of software for estimating GARCH models

    The values given are the benchmark estimates Standard errors differ substantially from one method to the next Those given are the Bollerslev and Wooldridge 1992 results See McCullough and Renfro 1999
    33 Bollerslev

    argues that in view of the complexity of the computations involved in estimating the GARCH model it is useful to have a test for GARCH effects This case is one as are many other maximum likelihood problems in which the apparatus for carrying out the test is the same as that for estimating the model however Having computed the LM statistic for GARCH effects one can proceed to estimate the model just by allowing the program to iterate to convergence There is no additional cost beyond waiting for the answer

    Greene 50240

    book

    June 17 2002

    16 21

    246

    CHAPTER 11 Heteroscedasticity

    correct produces a consistent estimator despite the misspeci cation 34 The asymptotic covariance matrices for the parameter estimators must be adjusted however The general result for cases such as this one see Gourieroux Monfort and Trognon 1984 is that the appropriate asymptotic covariance matrix for the pseudo MLE of a parameter vector would be Asy Var H 1 FH 1 where H E and F E ln L ln L 2 ln L 11 33

    that is the BHHH estimator and ln L is the used but inappropriate log likelihood function For current purposes H and F are still block diagonal so we can treat the mean and variance parameters separately In addition E vt is still zero so the second derivative terms in both blocks are quite simple The parts involving 2 t2 and 2 t2 fall out of the expectation Taking expectations and inserting the parts produces the corrected asymptotic covariance matrix for the variance parameters Asy Var PMLE W W 1 B B W W 1 where the rows of W are de ned in 18 30 and those of B are in 11 28 For the slope parameters the adjusted asymptotic covariance matrix would be
    T

    Asy Var PMLE C

    1

    C 1
    t 1

    bt bt C

    1

    C 1

    where the outer matrix is de ned in 11 31 and from the rst derivatives given in 11 29 and 11 31 bt x t t 1 2 t 2 vt t2 dt 35

    11 9

    SUMMARY AND CONCLUSIONS

    This chapter has analyzed one form of the generalized regression model the model of heteroscedasticity We rst considered least squares estimation The primary result for
    34 White

    1982a gives some additional requirements for the true underlying density of t Gourieroux Monfort and Trognon 1984 also consider the issue Under the assumptions given the expectations of the matrices in 18 27 and 18 32 remain the same as under normality The consistency and asymptotic normality of the pseudo MLE can be argued under the logic of GMM estimators

    35 McCullough

    and Renfro 1999 examined several approaches to computing an appropriate asymptotic covariance matrix for the GARCH model including the conventional Hessian and BHHH estimators and three sandwich style estimators including the one suggested above and two based on the method of scoring suggested by Bollerslev and Wooldridge 1992 None stand out as obviously better but the Bollerslev and QMLE estimator based on an actual Hessian appears to perform well in Monte Carlo studies

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    247

    least squares estimation is that it retains its consistency and asymptotic normality but some correction to the estimated asymptotic covariance matrix may be needed for appropriate inference The White estimator is the standard approach for this computation These two results also constitute the GMM estimator for this model After examining some general tests for heteroscedasticity we then narrowed the model to some speci c parametric forms and considered weighted generalized least squares and maximum likelihood estimation If the form of the heteroscedasticity is known but involves unknown parameters then it remains uncertain whether FGLS corrections are better than OLS Asymptotically the comparison is clear but in small or moderately sized samples the additional variation incorporated by the estimated variance parameters may offset the gains to GLS The nal section of this chapter examined a model of stochastic volatility the GARCH model This model has proved especially useful for analyzing nancial data such as exchange rates in ation and market returns Key Terms and Concepts
    ARCH model ARCH in mean Breusch Pagan test Double length regression Ef cient two step estimator GARCH model Generalized least squares Generalized sum of squares GMM estimator Goldfeld Quandt test Groupwise Lagrange multiplier test Heteroscedasticity Likelihood ratio test Maximum likelihood Robustness to unknown

    heteroscedasticity
    Stationarity condition Stochastic volatility Two step estimator Wald test Weighted least squares White estimator White s test

    estimators
    Model based test Moving average Multiplicative

    heteroscedasticity
    Nonconstructive test Residual based test Robust estimator

    heteroscedasticity

    Exercises 1 Suppose that the regression model is yi i where E i xi 0 Cov i j xi x j 0 for i j but Var i xi 2 xi2 xi 0 a Given a sample of observations on yi and xi what is the most ef cient estimator of What is its variance b What is the OLS estimator of and what is the variance of the ordinary least squares estimator c Prove that the estimator in part a is at least as ef cient as the estimator in part b 2 For the model in the previous exercise what is the probability limit of s 2 n 1 2 2 i 1 yi y Note that s is the least squares estimator of the residual variance n It is also n times the conventional estimator of the variance of the OLS estimator Est Var y s 2 X X 1 s2 n

    How does this equation compare with the true value you found in part b of Exercise 1 Does the conventional estimator produce the correct estimate of the true asymptotic variance of the least squares estimator

    Greene 50240

    book

    June 17 2002

    16 21

    248

    CHAPTER 11 Heteroscedasticity

    3 Two samples of 50 observations each produce the following moment matrices In each case X is a constant and one variable Sample 1 XX 50 300 300 2100 2000 Sample 2 50 300 300 300 2100 2200

    y X 300 yy

    2100

    2800

    a Compute the least squares regression coef cients and the residual variances s 2 for each data set Compute the R2 for each regression b Compute the OLS estimate of the coef cient vector assuming that the coef cients and disturbance variance are the same in the two regressions Also compute the estimate of the asymptotic covariance matrix of the estimate c Test the hypothesis that the variances in the two regressions are the same without assuming that the coef cients are the same in the two regressions d Compute the two step FGLS estimator of the coef cients in the regressions assuming that the constant and slope are the same in both regressions Compute the estimate of the covariance matrix and compare it with the result of part b 4 Using the data in Exercise 3 use the Oberhofer Kmenta method to compute the maximum likelihood estimate of the common coef cient vector 5 This exercise is based on the following data set
    50 Observations on Y 1 42 0 26 0 62 1 26 5 51 0 35 1 65 0 63 1 78 0 80 0 02 0 18 0 67 0 74 0 61 1 77 1 87 2 01 2 75 4 87 7 01 0 15 15 22 0 48 1 48 0 34 1 25 1 32 0 33 1 62 0 70 1 87 2 32 2 92 3 45 1 26 2 10 5 94 26 14 3 41 1 47 1 24 5 08 2 21 7 39 5 45 1 48 0 69 1 49 6 87 0 79 1 31 6 66 1 91 1 00 0 90 1 93 1 52 1 78 0 16 1 61 1 97 2 04 2 62 1 11 2 11 23 17 3 00 5 16 1 66 3 82 2 52 6 31 4 71

    50 Observations on X1 0 77 0 35 0 22 0 16 1 99 0 39 0 67 0 79 1 25 1 06 0 70 0 17 0 68 0 77 0 12 0 60 0 17 1 02 0 19 2 07 1 51 1 50 1 42 2 23 0 23 1 04 0 66 0 79 0 33 0 40 0 28 1 06 0 86 0 48 1 13 0 58 0 66 2 04 1 90 0 15 0 41 1 18 0 51 0 18

    50 Observations on X2 0 32 1 56 4 38 1 94 0 88 2 02 2 88 0 37 2 16 2 09 1 53 1 91 1 28 1 20 0 30 0 46 2 70 2 72 0 26 0 17 0 19 1 77 0 70 1 34 7 82 0 39 1 89 1 55 2 10 1 15 1 54 1 85

    a Compute the ordinary least squares regression of Y on a constant X1 and X2 Be sure to compute the conventional estimator of the asymptotic covariance matrix of the OLS estimator as well

    Greene 50240

    book

    June 17 2002

    16 21

    CHAPTER 11 Heteroscedasticity

    249

    6 7

    8 9

    10

    11

    b Compute the White estimator of the appropriate asymptotic covariance matrix for the OLS estimates c Test for the presence of heteroscedasticity using White s general test Do your results suggest the nature of the heteroscedasticity d Use the Breusch Pagan Lagrange multiplier test to test for heteroscedasticity e Sort the data keying on X1 and use the Goldfeld Quandt test to test for heteroscedasticity Repeat the procedure using X2 What do you nd Using the data of Exercise 5 reestimate the parameters using a two step FGLS estimator Try the estimator used in Example 11 4 For the model in Exercise 1 suppose that is normally distributed with mean zero and variance 2 1 x 2 Show that 2 and 2 can be consistently estimated by a regression of the least squares residuals on a constant and x 2 Is this estimator ef cient Derive the log likelihood function rst order conditions for maximization and information matrix for the model yi xi i i N 0 2 zi 2 Suppose that y has the pdf f y x 1 x e y x y 0 Then E y x x and Var y x x 2 For this model prove that GLS and MLE are the same even though this distribution involves the same parameters in the conditional mean function and the disturbance variance In the discussion of Harvey s model in Section 11 7 it is noted that the initial estimator of 1 the constant term in the regression of ln ei2 on a constant and zi is inconsistent by the amount 1 2704 Harvey points out that if the purpose of this initial regression is only to obtain starting values for the iterations then the correction is not necessary Explain why this statement would be true This exercise requires appropriate computer software The computations required can be done with RATS EViews Stata TSP LIMDEP and a variety of other software using only preprogrammed procedures Quarterly data on the consumer price index for 1950 1 to 2000 4 are given in Appendix Table F5 1 Use these data to t the model proposed by Engle and Kraft 1983 The model is t 0 1 t 1 2 t 2 3 t 3 4 t 4 t where t 100 ln pt pt 1 and pt is the price index a Fit the model by ordinary least squares then use the tests suggested in the text to see if ARCH effects appear to be present b The authors t an ARCH 8 model with declining weights
    8

    t2

    0
    i 1

    9 i 36

    t2 i

    Fit this model If the software does not allow constraints on the coef cients you can still do this with a two step least squares procedure using the least squares residuals from the rst step What do you nd c Bollerslev 1986 recomputed this model as a GARCH 1 1 Use the GARCH 1 1 form and re t your model

    Greene 50240

    book

    June 17 2002

    14 1

    12

    SERIAL CORRELATION

    Q
    12 1 INTRODUCTION Time series data often display autocorrelation or serial correlation of the disturbances across periods Consider for example the plot of the least squares residuals in the following example
    Example 12 1 Money Demand Equation

    Table F5 1 contains quarterly data from 1950 1 to 2000 4 on the U S money stock M1 and output real GDP and the price level CPI U Consider a simple extremely model of money demand 1 ln M1t 1 2 ln GDPt 3 ln CPIt t A plot of the least squares residuals is shown in Figure 12 1 The pattern in the residuals suggests that knowledge of the sign of a residual in one period is a good indicator of the sign of the residual in the next period This knowledge suggests that the effect of a given disturbance is carried at least in part across periods This sort of memory in the disturbances creates the long slow swings from positive values to negative ones that is evident in Figure 12 1 One might argue that this pattern is the result of an obviously naive model but that is one of the important points in this discussion Patterns such as this usually do not arise spontaneously to a large extent they are indeed a result of an incomplete or awed model speci cation

    One explanation for autocorrelation is that relevant factors omitted from the timeseries regression like those included are correlated across periods This fact may be due to serial correlation in factors that should be in the regression model It is easy to see why this situation would arise Example 12 2 shows an obvious case
    Example 12 2 Autocorrelation Induced by Misspeci cation of the Model

    In Examples 2 3 and 7 6 we examined yearly time series data on the U S gasoline market from 1960 to 1995 The evidence in the examples was convincing that a regression model of variation in ln G pop should include at a minimum a constant ln PG and ln income pop Other price variables and a time trend also provide signi cant explanatory power but these two are a bare minimum Moreover we also found on the basis of a Chow test of structural change that apparently this market changed structurally after 1974 Figure 12 2 displays plots of four sets of least squares residuals Parts a through c show clearly that as the speci cation of the regression is expanded the autocorrelation in the residuals diminishes Part c shows the effect of forcing the coef cients in the equation to be the same both before and after the structural shift In part d the residuals in the two subperiods 1960 to 1974 and 1975 to 1995 are produced by separate unrestricted regressions This latter set of residuals is almost nonautocorrelated Note also that the range of variation of the residuals falls as

    1 Since this chapter deals exclusively with time series data we shall use the index t

    for observations and T for

    the sample size throughout

    250

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    251

    Least Squares Residuals 225 150 075 Residual 000 075 150 225 300 1950
    FIGURE 12 1

    1963

    1976 Quarter

    1989

    2002

    Autocorrelated Residuals

    the model is improved i e as its t improves The full equation is ln It Gt 1 2 ln PGt 3 ln 4 ln PNCt 5 ln PU Ct popt popt 6 ln PPT t 7 ln PNt 8 ln PDt 9 ln PSt 10 t t

    Finally we consider an example in which serial correlation is an anticipated part of the model
    Example 12 3 Negative Autocorrelation in the Phillips Curve

    The Phillips curve Phillips 1957 has been one of the most intensively studied relationships in the macroeconomics literature As originally proposed the model speci es a negative relationship between wage in ation and unemployment in the United Kingdom over a period of 100 years Recent research has documented a similar relationship between unemployment and price in ation It is dif cult to justify the model when cast in simple levels labor market theories of the relationship rely on an uncomfortable proposition that markets persistently fall victim to money illusion even when the in ation can be anticipated Current research e g Staiger et al 1996 has reformulated a short run disequilibrium expectations augmented Phillips curve in terms of unexpected in ation and unemployment that deviates from a long run equilibrium or natural rate The expectations augmented Phillips curve can be written as pt E pt
    t 1

    ut u t

    where pt is the rate of in ation in year t E pt t 1 is the forecast of pt made in period t 1 based on information available at time t 1 t 1 ut is the unemployment rate and u is the natural or equilibrium rate Whether u can be treated as an unchanging parameter as we are about to do is controversial By construction ut u is disequilibrium or cyclical unemployment In this formulation t would be the supply shock i e the stimulus that produces the disequilibrium situation To complete the model we require a model for the expected in ation We will revisit this in some detail in Example 19 2 For the present we ll

    Greene 50240

    book

    June 17 2002

    14 1

    252

    CHAPTER 12 Serial Correlation
    Residuals Bars mark mean res and 2s e 10 05 Residuals Bars mark mean res and 2s e

    225 150 Residual Residual 1959 1964 1969 1974 1979 1984 1989 1994 1999 Year a Regression on log PG Residuals Bars mark mean res and 04 03 02 Residual Residual 01 00 01 02 03 04 1959 1964 1969 1974 1979 1984 1989 1994 1999 Year c Full Regression 2s e 075 000 075 150 225

    00 05 10 1959 1964 1969 1974 1979 1984 1989 1994 1999 Year b Regression on log PG Log I Pop Residuals Bars mark mean res and 025 020 015 010 005 000 005 010 015 020 1959 1964 1969 1974 1979 1984 1989 1994 1999 Year d Full Regression Separate Coefficients 2s e

    FIGURE 12 2

    Residual Plots for Misspeci ed Models

    assume that economic agents are rank empiricists The forecast of next year s in ation is simply this year s value This produces the estimating equation pt pt 1 1 2 ut t

    where 2 and 1 u Note that there is an implied estimate of the natural rate of unemployment embedded in the equation After estimation u can be estimated by b1 b2 The equation was estimated with the 1950 1 2000 4 data in Table F5 1 that were used in Example 12 1 minus two quarters for the change in the rate of in ation Least squares estimates with standard errors in parentheses are as follows pt pt 1 0 49189 0 090136 ut et 0 7405 0 1257 R 2 0 002561 T 201

    The implied estimate of the natural rate of unemployment is 5 46 percent which is in line with other recent estimates The estimated asymptotic covariance of b1 and b2 is 0 08973 Using the delta method we obtain a standard error of 2 2062 for this estimate so a con dence interval for the natural rate is 5 46 percent 1 96 2 21 percent 1 13 percent 9 79 percent which seems fairly wide but again whether it is reasonable to treat this as a parameter is at least questionable The regression of the least squares residuals on their past values gives a slope of 0 4263 with a highly signi cant t ratio of 6 725 We thus conclude that the

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    253

    Phillips Curve Deviations from Expected Inflation 10

    5

    Residual

    0

    5

    10

    15 1950
    FIGURE 12 3

    1963

    1976 Quarter

    1989

    2002

    Negatively Autocorrelated Residuals

    residuals and apparently the disturbances in this model are highly negatively autocorrelated This is consistent with the striking pattern in Figure 12 3

    The problems for estimation and inference caused by autocorrelation are similar to although unfortunately more involved than those caused by heteroscedasticity As before least squares is inef cient and inference based on the least squares estimates is adversely affected Depending on the underlying process however GLS and FGLS estimators can be devised that circumvent these problems There is one qualitative difference to be noted In Chapter 11 we examined models in which the generalized regression model can be viewed as an extension of the regression model to the conditional second moment of the dependent variable In the case of autocorrelation the phenomenon arises in almost all cases from a misspeci cation of the model Views differ on how one should react to this failure of the classical assumptions from a pragmatic one that treats it as another problem in the data to an orthodox methodological view that it represents a major speci cation issue see for example A Simple Message to Autocorrelation Correctors Don t Mizon 1995 We should emphasize that the models we shall examine here are quite far removed from the classical regression The exact or small sample properties of the estimators are rarely known and only their asymptotic properties have been derived

    12 2

    THE ANALYSIS OF TIME SERIES DATA

    The treatment in this chapter will be the rst structured analysis of time series data in the text We had a brief encounter in Section 5 3 where we established some conditions

    Greene 50240

    book

    June 17 2002

    14 1

    254

    CHAPTER 12 Serial Correlation

    under which moments of time series data would converge Time series analysis requires some revision of the interpretation of both data generation and sampling that we have maintained thus far A time series model will typically describe the path of a variable yt in terms of contemporaneous and perhaps lagged factors xt disturbances innovations t and its own past yt 1 For example yt 1 2 xt 3 yt 1 t The time series is a single occurrence of a random event For example the quarterly series on real output in the United States from 1950 to 2000 that we examined in Example 12 1 is a single realization of a process GDPt The entire history over this period constitutes a realization of the process At least in economics the process could not be repeated There is no counterpart to repeated sampling in a cross section or replication of an experiment involving a time series process in physics or engineering Nonetheless were circumstances different at the end of World War II the observed history could have been different In principle a completely different realization of the entire series might have occurred The sequence of observations yt tt is a time series process which is characterized by its time ordering and its systematic correlation between observations in the sequence The signature characteristic of a time series process is that empirically the data generating mechanism produces exactly one realization of the sequence Statistical results based on sampling characteristics concern not random sampling from a population but from distributions of statistics constructed from sets of observations taken from this realization in a time window t 1 T Asymptotic distribution theory in this context concerns behavior of statistics constructed from an increasingly long window in this sequence The properties of yt as a random variable in a cross section are straightforward and are conveniently summarized in a statement about its mean and variance or the probability distribution generating yt The statement is less obvious here It is common to assume that innovations are generated independently from one period to the next with the familiar assumptions E t 0 Var t 2 and Cov t s 0 for t s

    In the current context this distribution of t is said to be covariance stationary or weakly stationary Thus although the substantive notion of random sampling must be extended for the time series t the mathematical results based on that notion apply here It can be said for example that t is generated by a time series process whose mean and variance are not changing over time As such by the method we will discuss in this chapter we could at least in principle obtain sample information and use it to characterize the distribution of t Could the same be said of yt There is an obvious difference between the series t and yt observations on yt at different points in time are necessarily correlated Suppose that the yt series is weakly stationary and that for

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    255

    the moment 2 0 Then we could say that E yt 1 3 E yt 1 E t 1 1 3 and
    2 Var yt 3 Var yt 1 Var t

    or
    2 0 3 0 2

    so that 0 2 2 1 3

    Thus 0 the variance of yt is a xed characteristic of the process generating yt Note how the stationarity assumption which apparently includes 3 1 has been used The assumption that 3 1 is needed to ensure a nite and positive variance 2 Finally the same results can be obtained for nonzero 2 if it is further assumed that xt is a weakly stationary series 3 Alternatively consider simply repeated substitution of lagged values into the expression for yt yt 1 3 1 3 yt 2 t 1 t 12 1 and so on We see that in fact the current yt is an accumulation of the entire history of the innovations t So if we wish to characterize the distribution of yt then we might do so in terms of sums of random variables By continuing to substitute for yt 2 then yt 3 in 12 1 we obtain an explicit representation of this idea


    yt
    i 0

    i 3 1 t i

    Do sums that reach back into in nite past make any sense We might view the process as having begun generating data at some remote effectively in nite past As long as distant observations become progressively less important the extension to an in nite past is merely a mathematical convenience The diminishing importance of past observations is implied by 3 1 Notice that not coincidentally this requirement is the same as that needed to solve for 0 in the preceding paragraphs A second possibility is to assume that the observation of this time series begins at some time 0 with x0 0 called the initial conditions by which time the underlying process has reached a state such that the mean and variance of yt are not or are no longer changing over time The mathematics are slightly different but we are led to the same characterization of the random process generating yt In fact the same weak stationarity assumption ensures both of them Except in very special cases we would expect all the elements in the T component random vector y1 yT to be correlated In this instance said correlation is called
    2 The 3 See

    current literature in macroeconometrics and time series analysis is dominated by analysis of cases in which 3 1 or counterparts in different models We will return to this subject in Chapter 20 Section 12 4 1 on the stationarity assumption

    Greene 50240

    book

    June 17 2002

    14 1

    256

    CHAPTER 12 Serial Correlation

    autocorrelation As such the results pertaining to estimation with independent or uncorrelated observations that we used in the previous chapters are no longer usable In point of fact we have a sample of but one observation on the multivariate random variable yt t 1 T There is a counterpart to the cross sectional notion of parameter estimation but only under assumptions e g weak stationarity that establish that parameters in the familiar sense even exist Even with stationarity it will emerge that for estimation and inference none of our earlier nite sample results are usable Consistency and asymptotic normality of estimators are somewhat more dif cult to establish in time series settings because results that require independent observations such as the central limit theorems are no longer usable Nonetheless counterparts to our earlier results have been established for most of the estimation problems we consider here and in Chapters 19 and 20

    12 3

    DISTURBANCE PROCESSES

    The preceding section has introduced a bit of the vocabulary and aspects of time series speci cation In order to obtain the theoretical results we need to draw some conclusions about autocorrelation and add some details to that discussion
    12 3 1 CHARACTERISTICS OF DISTURBANCE PROCESSES

    In the usual time series setting the disturbances are assumed to be homoscedastic but correlated across observations so that E X 2 where 2 is a full positive de nite matrix with a constant 2 Var t X on the diagonal As will be clear in the following discussion we shall also assume that t s is a function of t s but not of t or s alone which is a stationarity assumption See the preceding section It implies that the covariance between observations t and s is a function only of t s the distance apart in time of the observations We de ne the autocovariances Cov t t s X Cov t s t X 2 Note that 2
    tt t t s

    s s

    0 The correlation between t and t s is their autocorrelation Cov t t s X Var t X Var t s X s s s 0

    Corr t t s X We can then write

    E X

    0 R

    where is an autocovariance matrix and R is an autocorrelation matrix the ts element is an autocorrelation coef cient t s ts 0

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    257

    Note that the matrix 0 R is the same as 2 The name change conforms to standard usage in the literature We will usually use the abbreviation s to denote the autocorrelation between observations s periods apart Different types of processes imply different patterns in R For example the most frequently analyzed process is a rst order autoregression or AR 1 process t t 1 ut where ut is a stationary nonautocorrelated white noise process and is a parameter We will verify later that for this process s s Higher order autoregressive processes of the form t 1 t 1 2 t 2 p t p ut imply more involved patterns including for some values of the parameters cyclical behavior of the autocorrelations 4 Stationary autoregressions are structured so that the in uence of a given disturbance fades as it recedes into the more distant past but vanishes only asymptotically For example for the AR 1 Cov t t s is never zero but it does become negligible if is less than 1 Moving average processes conversely have a short memory For the MA 1 process t ut ut 1
    2 2 the memory in the process is only one period 0 u 1 2 1 u but s 0 if s 1

    12 3 2

    AR 1 DISTURBANCES

    Time series processes such as the ones listed here can be characterized by their order the values of their parameters and the behavior of their autocorrelations 5 We shall consider various forms at different points The received empirical literature is overwhelmingly dominated by the AR 1 model which is partly a matter of convenience Processes more involved than this model are usually extremely dif cult to analyze There is however a more practical reason It is very optimistic to expect to know precisely the correct form of the appropriate model for the disturbance in any given situation The rst order autoregression has withstood the test of time and experimentation as a reasonable model for underlying processes that probably in truth are impenetrably complex AR 1 works as a rst pass higher order models are often constructed as a re nement as in the example below The rst order autoregressive disturbance or AR 1 process is represented in the autoregressive form as t t 1 ut where E ut 0
    2 E u2 u t
    4 This 5 See

    12 2

    model is considered in more detail in Chapter 20

    Box and Jenkins 1984 for an authoritative study

    Greene 50240

    book

    June 17 2002

    14 1

    258

    CHAPTER 12 Serial Correlation

    and Cov ut us 0 By repeated substitution we have t ut ut 1 2 ut 2 12 3 if t s

    From the preceding moving average form it is evident that each disturbance t embodies the entire past history of the u s with the most recent observations receiving greater weight than those in the distant past Depending on the sign of the series will exhibit clusters of positive and then negative observations or if is negative regular oscillations of sign as in Example 12 3 Since the successive values of ut are uncorrelated the variance of t is the variance of the right hand side of 12 3
    2 2 2 Var t u 2 u 4 u

    12 4

    To proceed a restriction must be placed on 1 12 5

    because otherwise the right hand side of 12 4 will become in nite This result is the stationarity assumption discussed earlier With 12 5 which implies that lims s 0 E t 0 and Var t
    2 u 2 1 2

    12 6

    With the stationarity assumption there is an easier way to obtain the variance
    2 Var t 2 Var t 1 u

    as Cov ut s 0 if t s With stationarity Var t 1 Var t which implies 12 6 Proceeding in the same fashion Cov t t 1 E t t 1 E t 1 t 1 ut Var t 1 By repeated substitution in 12 2 we see that for any s
    s 1 2 u 1 2

    12 7

    t s t s
    i 0 3 2

    i ut i

    e g t t 3 ut 2 ut 1 ut Therefore since s is not correlated with any ut for which t s i e any subsequent ut it follows that Cov t t s E t t s
    2 s u 1 2

    12 8

    2 Dividing by 0 u 1 2 provides the autocorrelations

    Corr t t s s s

    12 9

    With the stationarity assumption the autocorrelations fade over time Depending on the sign of they will either be declining in geometric progression or alternating in

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    259

    sign if is negative Collecting terms we have
    2 u 1 2

    2



    1
    2

    1 T 2

    2 1 T 3

    3 2

    T 2 T 3 1



    T 1



    12 10

    T 1

    12 4

    SOME ASYMPTOTIC RESULTS FOR ANALYZING TIME SERIES DATA

    Since is not equal to I the now familiar complications will arise in establishing the properties of estimators of in particular of the least squares estimator The nite sample properties of the OLS and GLS estimators remain intact Least squares will continue to be unbiased the earlier general proof allows for autocorrelated disturbances The Aitken theorem and the distributional results for normally distributed disturbances can still be established conditionally on X However even these will be complicated when X contains lagged values of the dependent variable But nite sample properties are of very limited usefulness in time series contexts Nearly all that can be said about estimators involving time series data is based on their asymptotic properties As we saw in our analysis of heteroscedasticity whether least squares is consistent or not depends on the matrices QT 1 T X X and Q 1 T X T X

    In our earlier analyses we were able to argue for convergence of QT to a positive de nite matrix of constants Q by invoking laws of large numbers But these theorems assume that the observations in the sums are independent which as suggested in Section 12 1 is surely not the case here Thus we require a different tool for this result We can expand the matrix Q as T Q T 1 T
    T T

    ts xt xs
    t 1 s 1

    12 11

    where xt and xs are rows of X and ts is the autocorrelation between t and s Suf cient conditions for this matrix to converge are that QT converge and that the correlations between disturbances die off reasonably rapidly as the observations become further apart in time For example if the disturbances follow the AR 1 process described earlier then ts t s and if x t is suf ciently well behaved Q will converge to a T positive de nite matrix Q as T

    Greene 50240

    book

    June 17 2002

    14 1

    260

    CHAPTER 12 Serial Correlation

    Asymptotic normality of the least squares and GLS estimators will depend on the behavior of sums such as T wT T 1 T
    T

    xt t
    t 1





    T

    1 X T

    Asymptotic normality of least squares is dif cult to establish for this general model The central limit theorems we have relied on thus far do not extend to sums of dependent observations The results of Amemiya 1985 Mann and Wald 1943 and Anderson 1971 do carry over to most of the familiar types of autocorrelated disturbances including those that interest us here so we shall ultimately conclude that ordinary least squares GLS and instrumental variables continue to be consistent and asymptotically normally distributed and in the case of OLS inef cient This section will provide a brief introduction to some of the underlying principles which are used to reach these conclusions
    12 4 1 CONVERGENCE OF MOMENTS THE ERGODIC THEOREM

    The discussion thus far has suggested appropriately that stationarity or its absence is an important characteristic of a process The points at which we have encountered this notion concerned requirements that certain sums converge to nite values In particular for the AR 1 model t t 1 ut in order for the variance of the process to be nite we require 1 which is a suf cient condition However this result is only a byproduct Stationarity at least the weak stationarity we have examined is only a characteristic of the sequence of moments of a distribution

    DEFINITION 12 1 Strong Stationarity A time series process zt tt is strongly stationary or stationary if the joint probability distribution of any set of k observations in the sequence zt zt 1 zt k is the same regardless of the origin t in the time scale

    2 For example in 12 2 if we add ut N 0 u then the resulting process t tt can easily be shown to be strongly stationary

    DEFINITION 12 2 Weak Stationarity A time series process zt tt is weakly stationary or covariance stationary if E zt is nite and is the same for all t and if the covariances between any two observations labeled their autocovariance Cov zt zt k is a nite function only of model parameters and their distance apart in time k but not of the absolute location of either observation on the time scale

    Weak stationary is obviously implied by strong stationary though it requires less since the distribution can at least in principle be changing on the time axis The distinction

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    261

    is rarely necessary in applied work In general save for narrow theoretical examples it will be dif cult to come up with a process that is weakly but not strongly stationary The reason for the distinction is that in much of our work only weak stationary is required and as always when possible econometricians will dispense with unnecessary assumptions As we will discover shortly stationarity is a crucial characteristic at this point in the analysis If we are going to proceed to parameter estimation in this context we will also require another characteristic of a time series ergodicity There are various ways to delineate this characteristic none of them particularly intuitive We borrow one de nition from Davidson and MacKinnon 1993 p 132 which comes close

    DEFINITION 12 3 Ergodicity A time series process zt tt is ergodic if for any two bounded functions that map vectors in the a and b dimensional real vector spaces to real scalars f Ra R1 and g Rb R1
    k

    lim E f zt zt 1 zt a g zt k zt k 1 zt k b E f zt zt 1 zt a E g zt k zt k 1 zt k b

    The de nition states essentially that if events are separated far enough in time then they are asymptotically independent An implication is that in a time series every observation will contain at least some unique information Ergodicity is a crucial element of our theory of estimation When a time series has this property with stationarity then we can consider estimation of parameters in a meaningful sense 6 The analysis relies heavily on the following theorem

    THEOREM 12 1 The Ergodic Theorem If zt tt is a time series process which is stationary and ergodic and E zt is a s a nite constant and E zt and if zT 1 T tT 1 zt then zT Note that the convergence is almost surely not in probability which is implied or in mean square which is also implied See White 2001 p 44 and Davidson and MacKinnon 1993 p 133

    What we have in The Ergodic Theorem is for sums of dependent observations a counterpart to the laws of large numbers that we have used at many points in the preceding chapters Note once again the need for this extension is that to this point our laws of
    6 Much

    of the analysis in later chapters will encounter nonstationary series which are the focus of most of the current literature tests for nonstationarity largely dominate the recent study in time series analysis Ergodicity is a much more subtle and dif cult concept For any process which we will consider ergodicity will have to be a given at least at this level A classic reference on the subject is Doob 1953 Another authoritative treatise is Billingsley 1979 White 2001 provides a concise analysis of many of these concepts as used in econometrics and some useful commentary

    Greene 50240

    book

    June 17 2002

    14 1

    262

    CHAPTER 12 Serial Correlation

    large numbers have required sums of independent observations But in this context by design observations are distinctly not independent In order for this result to be useful we will require an extension

    THEOREM 12 2 Ergodicity of Functions If zt tt is a time series process which is stationary and ergodic and if yt f zt is a measurable function in the probability space that de nes zt then yt is also stationary and ergodic Let zt tt de ne a K 1 vector valued stochastic process each element of the vector is an ergodic and stationary series and the characteristics of ergodicity and stationarity apply to the joint distribution of the elements of zt tt Then The Ergodic Theorem applies to functions of zt tt See White 2001 pp 44 45 for discussion

    Theorem 12 2 produces the results we need to characterize the least squares and other estimators In particular our minimal assumptions about the data are ASSUMPTION 12 1 Ergodic Data Series In the regression model yt xt t xt t tt is a jointly stationary and ergodic process By analyzing terms element by element we can use these results directly to assert that averages of wt xt t Qt xt xt and Q t2 xt xt will converge to their population t counterparts 0 Q and Q
    12 4 2 CONVERGENCE TO NORMALITY A CENTRAL LIMIT THEOREM

    In order to form a distribution theory for least squares GLS ML and GMM we will need a counterpart to the central limit theorem In particular we need to establish a large sample distribution theory for quantities of the form T 1 T
    T

    xt t
    t 1





    T w

    As noted earlier we cannot invoke the familiar central limit theorems Lindberg Levy Lindberg Feller Liapounov because the observations in the sum are not independent But with the assumptions already made we do have an alternative result Some needed preliminaries are as follows

    DEFINITION 12 4 Martingale Sequence A vector sequence zt is a martingale sequence if E zt zt 1 zt 2 zt 1

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    263

    An important example of a martingale sequence is the random walk zt zt 1 ut where Cov ut us 0 for all t s Then E zt zt 1 zt 2 E zt 1 zt 1 zt 2 E ut zt 1 zt 2 zt 1 0 zt 1

    DEFINITION 12 5 Martingale Difference Sequence A vector sequence zt is a martingale difference sequence if E zt zt 1 zt 2 0

    With De nition 12 5 we have the following broadly encompassing result

    THEOREM 12 3 Martingale Difference Central Limit Theorem If zt is a vector valued stationary and ergodic martingale difference sequence with E zt zt where is a nite positive de nite matrix and if zT 1 T tT 1 zt d then T zT N 0 For discussion see Davidson and MacKinnon 1993 Sections 4 7 and 4 8 7

    Theorem 12 3 is a generalization of the Lindberg Levy Central Limit Theorem It is not yet broad enough to cover cases of autocorrelation but it does go beyond Lindberg Levy for example in extending to the GARCH model of Section 11 8 Forms of the theorem which surpass Lindberg Feller D 19 and Liapounov Theorem D 20 by allowing for different variances at each time t appear in Ruud 2000 p 479 and White 2001 p 133 These variants extend beyond our requirements in this treatment But looking ahead this result encompasses what will be a very important application Suppose in the classical linear regression model xt tt is a stationary and ergodic multivariate stochastic process and t tt is an i i d process that is not autocorrelated and not heteroscedastic Then this is the most general case of the classical model which still maintains the assumptions about t that we made in Chapter 2 In this case the process wt tt xt t tt is a martingale difference sequence so that with suf cient assumptions on the moments of xt we could use this result to establish consistency and asymptotic normality of the least squares estimator See e g Hamilton 1994 pp 208 212 We now consider a central limit theorem that is broad enough to include the case that interested us at the outset stochastically dependent observations on xt and
    7 For

    convenience we are bypassing a step in this discussion establishing multivariate normality requires that the result rst be established for the marginal normal distribution of each component then that every linear combination of the variables also be normally distributed Our interest at this point is merely to collect the useful end results Interested users may nd the detailed discussions of the many subtleties and narrower points in White 2001 and Davidson and MacKinnon 1993 Chapter 4

    Greene 50240

    book

    June 17 2002

    14 1

    264

    CHAPTER 12 Serial Correlation

    autocorrelation in t 8 Suppose before that zt tt is a stationary and ergodic as stochastic process We consider T zT The following conditions are assumed 9 1 Summability of autocovariances With dependent observations lim Var T z


    T

    Cov zt zs
    t 0 s 0 k

    k





    To begin we will need to assume that this matrix is nite a condition called summability Note this is the condition needed for convergence of Q in 12 11 If the sum is to be T nite then the k 0 term must be nite which gives us a necessary condition E zt zt
    0

    a nite matrix

    2 Asymptotic uncorrelatedness E zt zt k zt k 1 converges in mean square to zero as k Note that is similar to the condition for ergodicity White 2001 demonstrates that a nonobvious implication of this assumption is E zt 0 3 Asymptotic negligibility of innovations Let rtk E zt zt k zt k 1 E zt zt k 1 zt k 2 An observation zt may be viewed as the accumulated information that has entered the process since it began up to time t Thus it can be shown that


    zt
    s 0

    rts

    The vector rtk can be viewed as the information in this accumulated sum that entered the process at time t k The condition imposed on the process is that 0 E rt s rts s be nite In words condition 3 states that information eventually becomes negligible as it fades far back in time from the current observation The AR 1 model as usual helps to illustrate this point If zt zt 1 ut then rt 0 E zt zt zt 1 E zt zt 1 zt 2 zt zt 1 ut rt 1 E zt zt 1 zt 2 E zt zt 2 zt 3 E zt 1 ut zt 1 zt 2 E zt 2 ut 1 ut zt 2 zt 3 zt 1 zt 2 ut 1 By a similar construction rtk kut k from which it follows that zt 0 s ut s which s we saw earlier in 12 3 You can verify that if 1 the negligibility condition will be met
    8 Detailed 9 See

    analysis of this case is quite intricate and well beyond the scope of this book Some fairly terse analysis may be found in White 2001 pp 122 133 and Hayashi 2000 Hayashi 2000 p 405 who attributes the results to Gordin 1969

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    265

    With all this machinery in place we now have the theorem we will need

    THEOREM 12 4 Gordin s Central Limit Theorem d If conditions 1 3 listed above are met then T zT N 0





    We will be able to employ these tools when we consider the least squares IV and GLS estimators in the discussion to follow

    12 5

    LEAST SQUARES ESTIMATION

    The least squares estimator is b X X 1 X y XX T
    1

    X T



    Unbiasedness follows from the results in Chapter 4 no modi cation is needed We know from Chapter 10 that the Gauss Markov Theorem has been lost assuming it exists that remains to be established the GLS estimator is ef cient and OLS is not How much information is lost by using least squares instead of GLS depends on the data Broadly least squares fares better in data which have long periods and little cyclical variation such as aggregate output series As might be expected the greater is the autocorrelation in the greater will be the bene t to using generalized least squares when this is possible Even if the disturbances are normally distributed the usual F and t statistics do not have those distributions So not much remains of the nite sample properties we obtained in Chapter 4 The asymptotic properties remain to be established
    12 5 1 ASYMPTOTIC PROPERTIES OF LEAST SQUARES

    The asymptotic properties of b are straightforward to establish given our earlier results If we assume that the process generating xt is stationary and ergodic then by Theorems 12 1 and 12 2 1 T X X converges to Q and we can apply the Slutsky theorem to the inverse If t is not serially correlated then wt xt t is a martingale difference sequence so 1 T X converges to zero This establishes consistency for the simple case On the other hand if xt t are jointly stationary and ergodic then we can invoke the Ergodic Theorems 12 1 and 12 2 for both moment matrices and establish consistency Asymptotic normality is a bit more subtle For the case without serial correlation in t we can employ Theorem 12 3 for T w The involved case is the one that interested us at the outset of this discussion that is where there is autocorrelation in t and dependence in xt Theorem 12 4 is in place for this case Once again the conditions described in the preceding section must apply and moreover the assumptions needed will have to be established both for xt and t Commentary on these cases may be found in Davidson and MacKinnon 1993 Hamilton 1994 White 2001 and Hayashi 2000 Formal presentation extends beyond the scope of this text so at this point we will proceed and assume that the conditions underlying Theorem 12 4 are met The results suggested

    Greene 50240

    book

    June 17 2002

    14 1

    266

    CHAPTER 12 Serial Correlation

    here are quite general albeit only sketched for the general case For the remainder of our examination at least in this chapter we will con ne attention to fairly simple processes in which the necessary conditions for the asymptotic distribution theory will be fairly evident There is an important exception to the results in the preceding paragraph If the regression contains any lagged values of the dependent variable then least squares will no longer be unbiased or consistent To take the simplest case suppose that yt yt 1 t t t 1 ut 12 12

    and assume 1 1 In this model the regressor and the disturbance are correlated There are various ways to approach the analysis One useful way is to rearrange 12 12 by subtracting yt 1 from yt Then yt yt 1 yt 2 ut 12 13

    which is a classical regression with stochastic regressors Since ut is an innovation in period t it is uncorrelated with both regressors and least squares regression of yt on yt 1 yt 2 estimates 1 and 2 What is estimated by regression of yt on yt 1 alone Let k Cov yt yt k Cov yt yt k By stationarity Var yt Var yt 1 and Cov yt yt 1 Cov yt 1 yt 2 and so on These and 12 13 imply the following relationships
    2 0 1 1 2 2 u

    1 1 0 2 1 2 1 1 2 0

    12 14

    These are the Yule Walker equations for this model See Section 20 2 3 The slope in the simple regression estimates 1 0 which can be found in the solutions to these three equations An alternative approach is to use the left out variable formula which is a useful way to interpret this estimator In this case we see that the slope in the short regression is an estimator of 1 0 In either case solving the three 2 equations in 12 14 for 0 1 and 2 in terms of 1 2 and u produces plim b 1 12 15

    This result is between when 0 and 1 when both and 1 Therefore least squares is inconsistent unless equals zero The more general case that includes regressors xt involves more complicated algebra but gives essentially the same result This is a general result when the equation contains a lagged dependent variable in the presence of autocorrelation OLS and GLS are inconsistent The problem can be viewed as one of an omitted variable
    12 5 2 ESTIMATING THE VARIANCE OF THE LEAST SQUARES ESTIMATOR

    As usual s 2 X X 1 is an inappropriate estimator of 2 X X 1 X X X X 1 both because s 2 is a biased estimator of 2 and because the matrix is incorrect Generalities

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    267

    TABLE 12 1 Variable

    Robust Covariance Estimation
    OLS SE Corrected SE

    OLS Estimate

    Constant ln Output ln CPI

    0 7746 0 2955 0 5613

    0 0335 0 0190 0 0339

    0 0733 0 0394 0 0708

    R2 0 99655 d 0 15388 r 0 92331

    are scarce but in general for economic time series which are positively related to their past values the standard errors conventionally estimated by least squares are likely to be too small For slowly changing trending aggregates such as output and consumption this is probably the norm For highly variable data such as in ation exchange rates and market returns the situation is less clear Nonetheless as a general proposition one would normally not want to rely on s 2 X X 1 as an estimator of the asymptotic covariance matrix of the least squares estimator In view of this situation if one is going to use least squares then it is desirable to have an appropriate estimator of the covariance matrix of the least squares estimator There are two approaches If the form of the autocorrelation is known then one can estimate the parameters of directly and compute a consistent estimator Of course if so then it would be more sensible to use feasible generalized least squares instead and not waste the sample information on an inef cient estimator The second approach parallels the use of the White estimator for heteroscedasticity Suppose that the form of the autocorrelation is unknown Then a direct estimator of or is not available The problem is estimation of 1 T
    T T

    t s xt xs
    t 1 s 1

    12 16

    Following White s suggestion for heteroscedasticity Newey and West s 1987a robust consistent estimator for autocorrelated disturbances with an unspeci ed structure is S S0 1 T
    L T

    1
    j 1 t j 1

    j et et j xt xt j xt j xt L 1

    12 17

    See 10 16 in Section 10 3 The maximum lag L must be determined in advance to be large enough that autocorrelations at lags longer than L are small enough to ignore For a moving average process this value can be expected to be a relatively small number For autoregressive processes or mixtures however the autocorrelations are never zero and the researcher must make a judgment as to how far back it is necessary to go 10
    Example 12 4 Autocorrelation Consistent Covariance Estimation

    For the model shown in Example 12 1 the regression results with the uncorrected standard errors and the Newey West autocorrelation robust covariance matrix for lags of 5 quarters are shown in Table 12 1 The effect of the very high degree of autocorrelation is evident
    10 Davidson

    and MacKinnon 1993 give further discussion Current practice is to use the smallest integer greater than or equal to T 1 4

    Greene 50240

    book

    June 17 2002

    14 1

    268

    CHAPTER 12 Serial Correlation

    12 6

    GMM ESTIMATION

    The GMM estimator in the regression model with autocorrelated disturbances is produced by the empirical moment equations 1 T
    T

    xt yt xt GMM
    t 1

    1 X GMM m GMM 0 T

    12 18

    The estimator is obtained by minimizing q m GMM Wm GMM where W is a positive de nite weighting matrix The optimal weighting matrix would be 1 W Asy Var T m which is the inverse of 1 Asy Var T m Asy Var T
    n T T

    xi i plim
    i 1 n

    1 T

    2 ts xt xs 2 Q
    t 1 s 1

    The optimal weighting matrix would be 2 Q 1 As in the heteroscedasticity case this minimization problem is an exactly identi ed case so the weighting matrix is irrelevant to the solution The GMM estimator for the regression model with autocorrelated disturbances is ordinary least squares We can use the results in Section 12 5 2 to construct the asymptotic covariance matrix We will require the assumptions in Section 12 4 to obtain convergence of the moments and asymptotic normality We will wish to extend this simple result in one instance In the common case in which xt contains lagged values of yt we will want to use an instrumental variable estimator We will return to that estimation problem in Section 12 9 4

    12 7

    TESTING FOR AUTOCORRELATION

    The available tests for autocorrelation are based on the principle that if the true disturbances are autocorrelated then this fact can be detected through the autocorrelations of the least squares residuals The simplest indicator is the slope in the arti cial regression et r et 1 vt et yt xt b
    T T

    12 19 et2

    r
    t 2

    et et 1
    t 1

    If there is autocorrelation then the slope in this regression will be an estimator of Corr t t 1 The complication in the analysis lies in determining a formal means of evaluating when the estimator is large that is on what statistical basis to reject

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    269

    the null hypothesis that equals zero As a rst approximation treating 12 19 as a classical linear model and using a t or F squared t test to test the hypothesis is a valid way to proceed based on the Lagrange multiplier principle We used this device in Example 12 3 The tests we consider here are re nements of this approach
    12 7 1 LAGRANGE MULTIPLIER TEST

    The Breusch 1978 Godfrey 1978 test is a Lagrange multiplier test of H0 no autocorrelation versus H1 t AR P or t MA P The same test is used for either structure The test statistic is LM T e X0 X0 X0 1 X0 e ee TR 2 0 12 20

    where X0 is the original X matrix augmented by P additional columns containing the lagged OLS residuals et 1 et P The test can be carried out simply by regressing the ordinary least squares residuals et on xt 0 lling in missing values for lagged residuals with zeros and referring TR 2 to the tabled critical value for the chi squared distribution 0 with P degrees of freedom 11 Since X e 0 the test is equivalent to regressing et on the part of the lagged residuals that is unexplained by X There is therefore a compelling logic to it if any t is found then it is due to correlation between the current and lagged residuals The test is a joint test of the rst P autocorrelations of t not just the rst
    12 7 2 BOX AND PIERCE S TEST AND LJUNG S REFINEMENT

    An alternative test which is asymptotically equivalent to the LM test when the null hypothesis 0 is true and when X does not contain lagged values of y is due to Box and Pierce 1970 The Q test is carried out by referring
    P

    Q T
    j 1

    r 2 j

    12 21

    where r j tT j 1 et et j tT 1 et2 to the critical values of the chi squared table with P degrees of freedom A re nement suggested by Ljung and Box 1979 is
    P

    Q T T 2
    j 1

    r2 j T j



    12 22

    The essential difference between the Godfrey Breusch and the Box Pierce tests is the use of partial correlations controlling for X and the other variables in the former and simple correlations in the latter Under the null hypothesis there is no autocorrelation in t and no correlation between xt and s in any event so the two tests are asymptotically equivalent On the other hand since it does not condition on xt the
    11 A warning to practitioners Current software varies on whether the lagged residuals are lled with zeros or the rst P observations are simply dropped when computing this statistic In the interest of replicability users should determine which is the case before reporting results

    Greene 50240

    book

    June 17 2002

    14 1

    270

    CHAPTER 12 Serial Correlation

    Box Pierce test is less powerful than the LM test when the null hypothesis is false as intuition might suggest
    12 7 3 THE DURBIN WATSON TEST

    The Durbin Watson statistic12 was the rst formal procedure developed for testing for autocorrelation using the least squares residuals The test statistic is d
    T 2 t 2 et et 1 T 2 t 1 et

    2 1 r

    2 2 e1 e T T 2 t 1 et

    12 23

    where r is the same rst order autocorrelation which underlies the preceding two statistics If the sample is reasonably large then the last term will be negligible leaving d 2 1 r The statistic takes this form because the authors were able to determine the exact distribution of this transformation of the autocorrelation and could provide tables of critical values Useable critical values which depend only on T and K are presented in tables such as that at the end of this book The one sided test for H0 0 against H1 0 is carried out by comparing d to values dL T K and dU T K If d dL the null hypothesis is rejected if d dU the hypothesis is not rejected If d lies between dL and dU then no conclusion is drawn
    12 7 4 TESTING IN THE PRESENCE OF A LAGGED DEPENDENT VARIABLES

    The Durbin Watson test is not likely to be valid when there is a lagged dependent variable in the equation 13 The statistic will usually be biased toward a nding of no autocorrelation Three alternatives have been devised The LM and Q tests can be used whether or not the regression contains a lagged dependent variable As an alternative to the standard test Durbin 1970 derived a Lagrange multiplier test that is appropriate in the presence of a lagged dependent variable The test may be carried out by referring h r T
    2 1 Tsc

    12 24

    2 where sc is the estimated variance of the least squares regression coef cient on yt 1 to the standard normal tables Large values of h lead to rejection of H0 The test has the virtues that it can be used even if the regression contains additional lags of yt and it can be computed using the standard results from the initial regression without any 2 further regressions If sc 1 T however then it cannot be computed An alternative is to regress et on xt yt 1 et 1 and any additional lags that are appropriate for et and then to test the joint signi cance of the coef cient s on the lagged residual s with the standard F test This method is a minor modi cation of the Breusch Godfrey test Under H0 the coef cients on the remaining variables will be zero so the tests are the same asymptotically

    12 Durbin 13 This

    and Watson 1950 1951 1971

    issue has been studied by Nerlove and Wallis 1966 Durbin 1970 and Dezhbaksh 1990

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation 12 7 5 SUMMARY OF TESTING PROCEDURES

    271

    The preceding has examined several testing procedures for locating autocorrelation in the disturbances In all cases the procedure examines the least squares residuals We can summarize the procedures as follows LM Test LM TR 2 in a regression of the least squares residuals on xt et 1 et P 2 Reject H0 if LM P This test examines the covariance of the residuals with lagged values controlling for the intervening effect of the independent variables
    2 Q Test Q T T 2 P 1 r 2 T j Reject H0 if Q P This test examines j j the raw correlations between the residuals and P lagged values of the residuals Durbin Watson Test d 2 1 r Reject H0 0 if d dL This test looks directly at the rst order autocorrelation of the residuals

    Durbin s Test FD the F statistic for the joint signi cance of P lags of the residuals in the regression of the least squares residuals on xt yt 1 yt R et 1 et P Reject H0 if FD F P T K P This test examines the partial correlations between the residuals and the lagged residuals controlling for the intervening effect of the independent variables and the lagged dependent variable The Durbin Watson test has some major shortcomings The inconclusive region is large if T is small or moderate The bounding distributions while free of the parameters and do depend on the data and assume that X is nonstochastic An exact version based on an algorithm developed by Imhof 1980 avoids the inconclusive region but is rarely used The LM and Box Pierce statistics do not share these shortcomings their limiting distributions are chi squared independently of the data and the parameters For this reason the LM test has become the standard method in applied research

    12 8

    EFFICIENT ESTIMATION WHEN

    I S KNOWN

    As a prelude to deriving feasible estimators for in this model we consider full generalized least squares estimation assuming that is known In the next section we will turn to the more realistic case in which must be estimated as well If the parameters of are known then the GLS estimator X
    1

    X 1 X

    1

    y

    12 25

    and the estimate of its sampling variance Est Var 2 X where 2 y X
    1 1

    X 1

    12 26

    y X

    T

    12 27

    Greene 50240

    book

    June 17 2002

    14 1

    272

    CHAPTER 12 Serial Correlation

    can be computed in one step For the AR 1 case data for the transformed model are 1 2 y1 1 2 x1 y2 y1 x2 x 1 y y3 y2 X x3 x2 12 28 yT yT 1 xT xT 1 These transformations are variously labeled partial differences quasi differences or pseudodifferences Note that in the transformed model every observation except the rst contains a constant term What was the column of 1s in X is transformed to 1 2 1 2 1 1 Therefore if the sample is relatively small then the problems with measures of t noted in Section 3 5 will reappear The variance of the transformed disturbance is
    2 Var t t 1 Var ut u 2 The variance of the rst disturbance is also u see 12 6 This can be estimated using 2 2 1 Corresponding results have been derived for higher order autoregressive processes For the AR 2 model

    t 1 t 1 2 t 2 ut the transformed data for generalized least squares are obtained by z 1
    2 1 2 1 2 2 1 1 2 2 1 2 2 z2 2 1 1 1 1 2 1 2

    12 29

    z1
    1 2

    12 30 z1

    z 2 1

    z t zt 1 zt 1 2 zt 2

    t 2

    where zt is used for yt or xt The transformation becomes progressively more complex for higher order processes 14 Note that in both the AR 1 and AR 2 models the transformation to y and X involves starting values for the processes that depend only on the rst one or two observations We can view the process as having begun in the in nite past Since the sample contains only T observations however it is convenient to treat the rst one or two or P observations as shown and consider them as initial values Whether we view the process as having begun at time t 1 or in the in nite past is ultimately immaterial in regard to the asymptotic properties of the estimators The asymptotic properties for the GLS estimator are quite straightforward given the apparatus we assembled in Section 12 4 We begin by assuming that xt t are
    14 See

    Box and Jenkins 1984 and Fuller 1976

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    273

    jointly an ergodic stationary process Then after the GLS transformation x t t is also stationary and ergodic Moreover t is nonautocorrelated by construction In the transformed model then w t x t t is a stationary and ergodic martingale difference series We can use the Ergodic Theorem to establish consistency and the Central Limit Theorem for martingale difference sequences to establish asymptotic normality for GLS in this model Formal arrangement of the relevant results is left as an exercise

    12 9

    ESTIMATION WHEN

    I S UNKNOWN

    For an unknown there are a variety of approaches Any consistent estimator of will suf ce recall from Theorem 10 8 in Section 10 5 2 all that is needed for ef cient estimation of is a consistent estimator of The complication arises as might be expected in estimating the autocorrelation parameter s

    12 9 1

    AR 1 DISTURBANCES

    The AR 1 model is the one most widely used and studied The most common procedure is to begin FGLS with a natural estimator of the autocorrelation of the residuals Since b is consistent we can use r Others that have been suggested include Theil s 1971 estimator r T K T 1 and Durbin s 1970 the slope on yt 1 in a regression of yt on yt 1 x t and x t 1 The second step is FGLS based on 12 25 12 28 This is the Prais and Winsten 1954 estimator The Cochrane and Orcutt 1949 estimator based on computational ease omits the rst observation It is possible to iterate any of these estimators to convergence Since the estimator is asymptotically ef cient at every iteration nothing is gained by doing so Unlike the heteroscedastic model iterating when there is autocorrelation does not produce the maximum likelihood estimator The iterated FGLS estimator regardless of the estimator of does not account for the term 1 2 ln 1 2 in the log likelihood function see the following 12 31 Maximum likelihood estimators can be obtained by maximizing the log likelihood 2 with respect to u and The log likelihood function may be written ln L
    T 2 t 1 ut 2 2 u



    1 T 2 ln 1 2 ln 2 ln u 2 2

    12 31

    where as before the rst observation is computed differently from the others using 2 12 28 For a given value of the maximum likelihood estimators of and u are the usual ones GLS and the mean squared residual using the transformed data The problem is estimation of One possibility is to search the range 1 1 for the value that with the implied estimates of the other parameters maximizes ln L This is Hildreth and Lu s 1960 approach Beach and MacKinnon 1978a argue that this way to do the search is very inef cient and have devised a much faster algorithm Omitting the rst observation and adding an approximation at the lower right corner produces

    Greene 50240

    book

    June 17 2002

    14 1

    274

    CHAPTER 12 Serial Correlation

    the standard approximations to the asymptotic variances of the estimators Est Asy Var ML ML X 1 X 2 ML Est Asy Var u ML 2 u ML T 2 4 Est Asy Var ML 1 ML 2 T
    1

    12 32

    All the foregoing estimators have the same asymptotic properties The available evidence on their small sample properties comes from Monte Carlo studies and is unfortunately only suggestive Griliches and Rao 1969 nd evidence that if the sample is relatively small and is not particularly large say less than 0 3 then least squares is as good as or better than FGLS The problem is the additional variation introduced into the sampling variance by the variance of r Beyond these the results are rather mixed Maximum likelihood seems to perform well in general but the Prais Winsten estimator is evidently nearly as ef cient Both estimators have been incorporated in all contemporary software In practice the Beach and MacKinnon s maximum likelihood estimator is probably the most common choice
    12 9 2 AR 2 DISTURBANCES

    Maximum likelihood procedures for most other disturbance processes are exceedingly complex Beach and MacKinnon 1978b have derived an algorithm for AR 2 disturbances For higher order autoregressive models maximum likelihood estimation is presently impractical but the two step estimators can easily be extended For models of the form t 1 t 1 2 t 2 p t p ut 12 33

    a simple approach for estimation of the autoregressive parameters is to use the following method Regress et on et 1 et p to obtain consistent estimates of the autoregressive parameters With the estimates of 1 p in hand the Cochrane Orcutt estimator can be obtained If the model is an AR 2 the full FGLS procedure can be used instead The least squares computations for the transformed data provide at least 2 asymptotically the appropriate estimates of u and the covariance matrix of As before iteration is possible but brings no gains in ef ciency
    12 9 3 APPLICATION ESTIMATION OF A MODEL WITH AUTOCORRELATION

    A restricted version of the model for the U S gasoline market that appears in Example 12 2 is ln Gt It 1 2 ln PG t 3 ln 4 ln PNC t 5 ln PUC t t popt popt

    The results in Figure 12 2 suggest that the speci cation above may be incomplete and if so there may be autocorrelation in the disturbance in this speci cation Least squares estimation of the equation produces the results in the rst row of Table 12 2 The rst 5 autocorrelations of the least squares residuals are 0 674 0 207 0 049 0 159 and 0 158 This produces Box Pierce and Box Ljung statistics of 19 816 and 21 788 respectively both of which are larger than the critical value from the chi squared table of 11 07 We regressed the least squares residuals on the independent variables and

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    275

    TABLE 12 2

    Parameter Estimates Standard Errors in Parentheses
    1 2 3 4 5

    OLS R2 0 95799 Prais Winsten Cochrane Orcutt Maximum Likelihood AR 2

    7 736 0 674 6 782 0 955 7 147 1 297 5 159 1 132 11 828 0 888

    0 0591 0 0325 0 152 0 0370 0 149 0 0382 0 208 0 0349 0 0310 0 0292

    1 373 0 0756 1 267 0 107 1 307 0 144 1 0828 0 127 1 415 0 0682

    0 127 0 127 0 0308 0 127 0 0599 0 146 0 0878 0 125 0 192 0 133

    0 119 0 0813 0 0638 0 0758 0 0563 0 0789 0 0351 0 0659 0 114 0 0846

    0 000 0 000 0 862 0 0855 0 849 0893 0 930 0 0620 0 760 r1

    1 0 9936319 2 4620284

    ve lags of the residuals The coef cients on the lagged residuals and the associated t statistics are 1 075 5 493 0 712 2 488 0 310 0 968 0 227 0 758 0 000096 0 000 The R2 in this regression is 0 598223 which produces a chi squared value of 21 536 The conclusion is the same Finally the Durbin Watson statistic is 0 60470 For four regressors and 36 observations the critical value of dl is 1 24 so on this basis as well the hypothesis 0 would be rejected The plot of the residuals shown in Figure 12 4 seems consistent with this conclusion The Prais and Winsten FGLS estimates appear in the second row of Table 12 4 followed by the Cochrane and Orcutt results then the maximum likelihood estimates

    FIGURE 12 4

    Least Squares Residuals

    Least Squares Residuals 075

    050

    025

    E

    000

    025

    050 075 1959

    1964

    1969

    1974

    1979 Year

    1984

    1989

    1994

    1999

    Greene 50240

    book

    June 17 2002

    14 1

    276

    CHAPTER 12 Serial Correlation

    In each of these cases the autocorrelation coef cient is reestimated using the FGLS residuals This recomputed value is what appears in the table One might want to examine the residuals after estimation to ascertain whether the AR 1 model is appropriate In the results above there are two large autocorrelation coef cients listed with the residual based tests and in computing the LM statistic we found that the rst two coef cients were statistically signi cant If the AR 1 model is appropriate then one should nd that only the coef cient on the rst lagged residual is statistically signi cant in this auxiliary second step regression Another indicator is provided by the FGLS residuals themselves After computing the FGLS regression the estimated residuals yt xt will still be autocorrelated In our results using the Prais Winsten estimates the autocorrelation of the FGLS residuals is 0 865 The associated Durbin Watson statistic is 0 278 This is to be expected However if the model is correct then the transformed residuals ut t t 1 should be at least close to nonautocorrelated But for our data the autocorrelation of the adjusted residuals is 0 438 with a Durbin Watson statistic of 1 125 It appears on this basis that in fact the AR 1 model has not completed the speci cation The results noted earlier suggest that an AR 2 process might better characterize the disturbances in this model Simple regression of the least squares residuals on a constant and two lagged values the two period counterpart to a method of obtaining r in the AR 1 model produces slope coef cients of 0 9936319 and 0 4620284 15 The GLS transformations for the AR 2 model are given in 12 30 We recomputed the regression using the AR 2 transformation and these two coef cients These are the nal results shown in Table 12 2 They do bring a substantial change in the results As an additional check on the adequacy of the model we now computed the corrected FGLS residuals from the AR 2 model ut t 1 t 1 2 t 2 The rst ve autocorrelations of these residuals are 0 132 0 134 0 016 0 022 and 0 118 The Box Pierce and Box Ljung statistics are 1 605 and 1 857 which are far from statistically signi cant We thus conclude that the AR 2 model accounts for the autocorrelation in the data The preceding suggests how one might discover the appropriate model for autocorrelation in a regression model However it is worth keeping in mind that the source of the autocorrelation might itself be discernible in the data The nding of an AR 2 process may still suggest that the regression speci cation is incomplete or inadequate in some way
    tting an AR 1 model the stationarity condition is obvious r must be less than one For an AR 2 process the condition is less than obvious We will examine this issue in Chapter 20 For the present we 2 merely state the result the two values 1 2 1 1 4 2 1 2 must be less than one in absolute value Since the term in parentheses might be negative the roots might be a complex pair a bi in which case a 2 b2 must be less than one You can verify that the two complex roots for our process above are indeed inside the unit circle
    15 In

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation 12 9 4 ESTIMATION WITH A LAGGED DEPENDENT VARIABLE

    277

    In Section 12 5 1 we considered the problem of estimation by least squares when the model contains both autocorrelation and lagged dependent variable s Since the OLS estimator is inconsistent the residuals on which an estimator of would be based are likewise inconsistent Therefore will be inconsistent as well The consequence is that the FGLS estimators described earlier are not usable in this case There is however an alternative way to proceed based on the method of instrumental variables The method of instrumental variables was introduced in Section 5 4 To review the general problem is that in the regression model if plim 1 T X 0 then the least squares estimator is not consistent A consistent estimator is bIV Z X 1 Z y where Z is set of K variables chosen such that plim 1 T Z 0 but plim 1 T Z X 0 For the purpose of consistency only any such set of instrumental variables will suf ce The relevance of that here is that the obstacle to consistent FGLS is at least for the present is the lack of a consistent estimator of By using the technique of instrumental variables we may estimate consistently then estimate and proceed Hatanaka 1974 1976 has devised an ef cient two step estimator based on this principle To put the estimator in the current context we consider estimation of the model yt xt yt 1 t t t 1 ut To get to the second step of FGLS we require a consistent estimator of the slope parameters These estimates can be obtained using an IV estimator where the column of Z corresponding to yt 1 is the only one that need be different from that of X An appropriate instrument can be obtained by using the tted values in the regression of yt on xt and xt 1 The residuals from the IV regression are then used to construct where t yt bIV xt cIV yt 1 FGLS estimates may now be computed by regressing y t yt yt 1 on x t xt xt 1 y t 1 yt 1 yt 2 t 1 yt 1 bIV xt 1 cIV yt 2 Let d be the coef cient on t 1 in this regression The ef cient estimator of is d Appropriate asymptotic standard errors for the estimators including are obtained 2 1 from the s X X computed at the second step Hatanaka shows that these estimators are asymptotically equivalent to maximum likelihood estimators
    T t 3 t t 1 T 2 t 3 t

    Greene 50240

    book

    June 17 2002

    14 1

    278

    CHAPTER 12 Serial Correlation

    12 10

    COMMON FACTORS

    We saw in Example 12 2 that misspeci cation of an equation could create the appearance of serially correlated disturbances when in fact there are none An orthodox perhaps somewhat optimistic purist might argue that autocorrelation is always an artifact of misspeci cation Although this view might be extreme see e g Hendry 1980 for a more moderate but still strident statement it does suggest a useful point It might be useful if we could examine the speci cation of a model statistically with this consideration in mind The test for common factors is such a test See as well the aforementioned paper by Mizon 1995 The assumption that the correctly speci ed model is yt xt t implies the reduced form M0 yt yt 1 xt xt 1 ut t 2 T t t 1 ut t 1 T

    where ut is free from serial correlation The second of these is actually a restriction on the model M1 yt yt 1 xt xt 1 ut t 2 T in which once again ut is a classical disturbance The second model contains 2 K 1 parameters but if the model is correct then and there are only K 1 parameters and K restrictions Both M0 and M1 can be estimated by least squares although M0 is a nonlinear model One might then test the restrictions of M0 using an F test This test will be valid asymptotically although its exact distribution in nite samples will not be precisely F In large samples KF will converge to a chi squared statistic so we use the F distribution as usual to be conservative There is a minor practical complication in implementing this test Some elements of may not be estimable For example if xt contains a constant term then the one in is unidenti ed If xt contains both current and lagged values of a variable then the one period lagged value will appear twice in M1 once in xt as the lagged value and once in xt 1 as the current value There are other combinations that will be problematic so the actual number of restrictions that appear in the test is reduced to the number of identi ed parameters in
    Example 12 5 Tests for Common Factors

    We will examine the gasoline demand model of Example 12 2 and consider a simpli ed version of the equation ln Gt It 1 2 ln PG t 3 ln 4 ln PNC t 5 ln PU C t t popt popt Gt 1 2 ln PG t ln PG t 1 3 popt ln Gt 1 popt 1 ut with six free coef cients will not signi cantly degrade the t of the unrestricted model which has 10 free coef cients The F statistic with 4 and 25 degrees of freedom for this test equals It I t 1 ln popt popt 1

    If the AR 1 model is appropriate for t then the restricted model ln ln

    4 ln PNC t ln PNC t 1 5 ln PU C t ln PU C t 1

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    279

    4 311 which is larger than the critical value of 2 76 Thus we would conclude that the AR 1 model would not be appropriate for this speci cation and these data Note that we reached the same conclusion after a more conventional analysis of the residuals in the application in Section 12 9 3

    12 11

    FORECASTING IN THE PRESENCE OF AUTOCORRELATION

    For purposes of forecasting we refer rst to the transformed model y t x t t Suppose that the process generating t is an AR 1 and that is known Since this model is a classical regression model the results of Section 6 6 may be used The optimal 0 forecast of y T 1 given x0 1 and xT i e x0T 1 x0 1 xT is T T y T 1 x0T 1 0 Disassembling y T 1 we nd that 0 yT 1 yT x0 1 xT 0 T or yT 1 x0 1 yT xT 0 T x0 1 eT T 12 34

    Thus we carry forward a proportion of the estimated disturbance in the preceding period This step can be justi ed by reference to E T 1 T T It can also be shown that to forecast n periods ahead we would use yT n x0 n n eT 0 T The extension to higher order autoregressions is direct For a second order model for example T yT n x0 n 1 eT n 1 2 eT n 2 0 For residuals that are outside the sample period we use the recursion es 1 es 1 2 es 2 12 36 12 35

    beginning with the last two residuals within the sample Moving average models are somewhat simpler as the autocorrelation lasts for only Q periods For an MA 1 model for the rst postsample period yT 1 x0 1 T 1 0 T where T 1 uT 1 uT

    Greene 50240

    book

    June 17 2002

    14 1

    280

    CHAPTER 12 Serial Correlation

    Therefore a forecast of T 1 will use all previous residuals One way to proceed is to accumulate T 1 from the recursion ut t ut 1 with uT 1 u0 0 and t yt xt After the rst postsample period T n uT n uT n 1 0 If the parameters of the disturbance process are known then the variances for the forecast errors can be computed using the results of Section 6 6 For an AR 1 disturbance the estimated variance would be s 2 2 xt xt 1 Est Var xt xt 1 f 12 37

    For a higher order process it is only necessary to modify the calculation of x t accordingly The forecast variances for an MA 1 process are somewhat more involved Details may be found in Judge et al 1985 and Hamilton 1994 If the parameters of the disturbance process j and so on are estimated as well then the forecast variance will be greater For an AR 1 model the necessary correction to the forecast variance of the n period ahead forecast error is 2 n2 2 n 1 T For a one period ahead forecast this merely adds a term 2 T in the brackets in 12 36 Higher order AR and MA processes are analyzed in Baillie 1979 Finally if the regressors are stochastic the expressions become more complex by another order of magnitude If is known then 12 34 provides the best linear unbiased forecast of yt 1 16 If however must be estimated then this assessment must be modi ed There is information about t 1 embodied in et Having to estimate however implies that some or all the value of this information is offset by the variation introduced into the forecast by including the stochastic component et 17 Whether 12 34 is preferable to T the obvious expedient yT n x0 n in a small sample when is estimated remains to 0 be settled

    12 12

    SUMMARY AND CONCLUSIONS

    This chapter has examined the generalized regression model with serial correlation in the disturbances We began with some general results on analysis of time series data When we consider dependent observations and serial correlation the laws of large numbers and central limit theorems used to analyze independent observations no longer suf ce We presented some useful tools which extend these results to time series settings We then considered estimation and testing in the presence of autocorrelation As usual OLS is consistent but inef cient The Newey West estimator is a robust estimator for the asymptotic covariance matrix of the OLS estimator This pair of estimators also constitute the GMM estimator for the regression model with autocorrelation We then considered two step feasible generalized least squares and maximum likelihood estimation for the special case usually analyzed by practitioners the AR 1 model The
    16 See 17 See

    Goldberger 1962 Baillie 1979

    Greene 50240

    book

    June 17 2002

    14 1

    CHAPTER 12 Serial Correlation

    281

    model with a correction for autocorrelation is a restriction on a more general model with lagged values of both dependent and independent variables We considered a means of testing this speci cation as an alternative to xing the problem of autocorrelation Key Terms and Concepts
    AR 1 Asymptotic negligibility Asymptotic normality Autocorrelation Autocorrelation matrix Autocovariance Autocovariance matrix Autoregressive form Cochrane Orcutt estimator Common factor model Covariance stationarity Durbin Watson test Ergodicity Ergodic Theorem First order autoregression Expectations augmented Partial difference Prais Winsten estimator Pseudo differences Q test Quasi differences Stationarity Summability Time series process Time window Weakly stationary White noise Yule Walker equations

    Phillips curve
    GMM estimator Initial conditions Innovation Lagrange multiplier test Martingale sequence Martingale difference

    sequence
    Moving average form Moving average process

    Exercises 1 Does rst differencing reduce autocorrelation Consider the models yt xt t where t t 1 ut and t ut ut 1 Compare the autocorrelation of t in the original model with that of vt in yt yt 1 xt xt 1 vt where vt t t 1 Derive the disturbance covariance matrix for the model yt xt t t t 1 ut ut 1 What parameter is estimated by the regression of the OLS residuals on their lagged values The following regression is obtained by ordinary least squares using 21 observations Estimated asymptotic standard errors are shown in parentheses yt 1 3 0 97 yt 1 2 31xt 0 3 0 18 1 04 4 D W 1 21

    2

    3

    5

    Test for the presence of autocorrelation in the disturbances It is commonly asserted that the Durbin Watson statistic is only appropriate for testing for rst order autoregressive disturbances What combination of the coef cients of the model is estimated by the Durbin Watson statistic in each of the following cases AR 1 AR 2 MA 1 In each case assume that the regression model does not contain a lagged dependent variable Comment on the impact on your results of relaxing this assumption The data used to t the expectations augmented Phillips curve in Example 12 3 are given in Table F5 1 Using these data reestimate the model given in the example Carry out a formal test for rst order autocorrelation using the LM statistic Then reestimate the model using an AR 1 model for the disturbance process Since the sample is large the Prais Winsten and Cochrane Orcutt estimators should

    Greene 50240

    book

    June 17 2002

    14 1

    282

    CHAPTER 12 Serial Correlation

    6

    give essentially the same answer Do they After tting the model obtain the transformed residuals and examine them for rst order autocorrelation Does the AR 1 model appear to have adequately xed the problem Data for tting an improved Phillips curve model can be obtained from many sources including the Bureau of Economic Analysis s BEA own website Economagic com and so on Obtain the necessary data and expand the model of example 12 3 Does adding additional explanatory variables to the model reduce the extreme pattern of the OLS residuals that appears in Figure 12 3

    Greene 50240

    book

    June 18 2002

    15 28

    13

    MODELS FOR PANEL DATA

    Q
    13 1 INTRODUCTION Data sets that combine time series and cross sections are common in economics For example the published statistics of the OECD contain numerous series of economic aggregates observed yearly for many countries Recently constructed longitudinal data sets contain observations on thousands of individuals or families each observed at several points in time Other empirical studies have analyzed time series data on sets of rms states countries or industries simultaneously These data sets provide rich sources of information about the economy Modeling in this setting however calls for some complex stochastic speci cations In this chapter we will survey the most commonly used techniques for time series cross section data analyses in single equation models

    13 2

    PANEL DATA MODELS

    Many recent studies have analyzed panel or longitudinal data sets Two very famous ones are the National Longitudinal Survey of Labor Market Experience NLS and the Michigan Panel Study of Income Dynamics PSID In these data sets very large cross sections consisting of thousands of microunits are followed through time but the number of periods is often quite small The PSID for example is a study of roughly 6 000 families and 15 000 individuals who have been interviewed periodically from 1968 to the present Another group of intensively studied panel data sets were those from the negative income tax experiments of the early 1970s in which thousands of families were followed for 8 or 13 quarters Constructing long evenly spaced time series in contexts such as these would be prohibitively expensive but for the purposes for which these data are typically used it is unnecessary Time effects are often viewed as transitions or discrete changes of state They are typically modeled as speci c to the period in which they occur and are not carried across periods within a cross sectional unit 1 Panel data sets are more oriented toward cross section analyses they are wide but typically short Heterogeneity across units is an integral part indeed often the central focus of the analysis

    1 Theorists have not been deterred from devising autocorrelation models applicable to panel data sets though

    See for example Lee 1978 or Park Sickles and Simar 2000 As a practical matter however the empirical literature in this eld has focused on cross sectional variation and less intricate time series models Formal time series modeling of the sort discussed in Chapter 12 is somewhat unusual in the analysis of longitudinal data

    283

    Greene 50240

    book

    June 18 2002

    15 28

    284

    CHAPTER 13 Models for Panel Data

    The analysis of panel or longitudinal data is the subject of one of the most active and innovative bodies of literature in econometrics 2 partly because panel data provide such a rich environment for the development of estimation techniques and theoretical results In more practical terms however researchers have been able to use time series cross sectional data to examine issues that could not be studied in either cross sectional or time series settings alone Two examples are as follows 1 In a widely cited study of labor supply Ben Porath 1973 observes that at a certain point in time in a cohort of women 50 percent may appear to be working It is ambiguous whether this nding implies that in this cohort onehalf of the women on average will be working or that the same one half will be working in every period These have very different implications for policy and for the interpretation of any statistical results Cross sectional data alone will not shed any light on the question 2 A long standing problem in the analysis of production functions has been the inability to separate economies of scale and technological change 3 Crosssectional data provide information only about the former whereas time series data muddle the two effects with no prospect of separation It is common for example to assume constant returns to scale so as to reveal the technical change 4 Of course this practice assumes away the problem A panel of data on costs or output for a number of rms each observed over several years can provide estimates of both the rate of technological change as time progresses and economies of scale for the sample of different sized rms at each point in time In principle the methods of Chapter 12 can be applied to longitudinal data sets In the typical panel however there are a large number of cross sectional units and only a few periods Thus the time series methods discussed there may be somewhat problematic Recent work has generally concentrated on models better suited to these short and wide data sets The techniques are focused on cross sectional variation or heterogeneity In this chapter we shall examine in detail the most widely used models and look brie y at some extensions The fundamental advantage of a panel data set over a cross section is that it will allow the researcher great exibility in modeling differences in behavior across individuals
    2 The

    panel data literature rivals the received research on unit roots and cointegration in econometrics in its rate of growth A compendium of the earliest literature is Maddala 1993 Book length surveys on the econometrics of panel data include Hsiao 1986 Dielman 1989 Matyas and Sevestre 1996 Raj and Baltagi 1992 and Baltagi 1995 There are also lengthy surveys devoted to speci c topics such as limited dependent variable models Hsiao Lahiri Lee and Pesaran 1999 and semiparametric methods Lee 1998 An extensive bibliography is given in Baltagi 1995 distinction between these two effects gured prominently in the policy question of whether it was appropriate to break up the AT T Corporation in the 1980s and ultimately to allow competition in the provision of long distance telephone service

    3 The

    a classic study of this issue Solow 1957 states From time series of Q Q w K K K w L and L L or their discrete year to year analogues we could estimate A A and thence A t itself Actually an amusing thing happens here Nothing has been said so far about returns to scale But if all factor inputs are classi ed either as K or L then the available gures always show w K and w L adding up to one Since we have assumed that factors are paid their marginal products this amounts to assuming the hypothesis of Euler s theorem The calculus being what it is we might just as well assume the conclusion namely the F is homogeneous of degree one
    4 In

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    285

    The basic framework for this discussion is a regression model of the form yit xi t zi it 13 1

    There are K regressors in xit not including a constant term The heterogeneity or individual effect is zi where zi contains a constant term and a set of individual or group speci c variables which may be observed such as race sex location and so on or unobserved such as family speci c characteristics individual heterogeneity in skill or preferences and so on all of which are taken to be constant over time t As it stands this model is a classical regression model If zi is observed for all individuals then the entire model can be treated as an ordinary linear model and t by least squares The various cases we will consider are 1 Pooled Regression If zi contains only a constant term then ordinary least squares provides consistent and ef cient estimates of the common and the slope vector 2 Fixed Effects If zi is unobserved but correlated with xit then the least squares estimator of is biased and inconsistent as a consequence of an omitted variable However in this instance the model yit xi t i it where i zi embodies all the observable effects and speci es an estimable conditional mean This xed effects approach takes i to be a group speci c constant term in the regression model It should be noted that the term xed as used here indicates that the term does not vary over time not that it is nonstochastic which need not be the case 3 Random Effects If the unobserved individual heterogeneity however formulated can be assumed to be uncorrelated with the included variables then the model may be formulated as yit xi t E zi zi E zi it xi t ui it that is as a linear regression model with a compound disturbance that may be consistently albeit inef ciently estimated by least squares This random effects approach speci es that ui is a group speci c random element similar to it except that for each group there is but a single draw that enters the regression identically in each period Again the crucial distinction between these two cases is whether the unobserved individual effect embodies elements that are correlated with the regressors in the model not whether these effects are stochastic or not We will examine this basic formulation then consider an extension to a dynamic model 4 Random Parameters The random effects model can be viewed as a regression model with a random constant term With a suf ciently rich data set we may extend this idea to a model in which the other coef cients vary randomly across individuals as well The extension of the model might appear as yit xi t hi ui it where hi is a random vector which induces the variation of the parameters across

    Greene 50240

    book

    June 18 2002

    15 28

    286

    CHAPTER 13 Models for Panel Data

    individuals This random parameters model was proposed quite early in this literature but has only fairly recently enjoyed widespread attention in several elds It represents a natural extension in which researchers broaden the amount of heterogeneity across individuals while retaining some commonalities the parameter vectors still share a common mean Some recent applications have extended this yet another step by allowing the mean value of the parameter distribution to be person speci c as in yit xi t zi hi ui it

    where zi is a set of observable person speci c variables and is a matrix of parameters to be estimated As we will examine later this hierarchical model is extremely versatile 5 Covariance Structures Lastly we will reconsider the source of the heterogeneity in the model In some settings researchers have concluded that a preferable approach to modeling heterogeneity in the regression model is to layer it into the variation around the conditional mean rather than in the placement of the mean In a cross country comparison of economic performance over time Alvarez Garrett and Lange 1991 estimated a model of the form yit f labor organizationit political organizationit it in which the regression function was fully speci ed by the linear part xi t but the variance of it differed across countries Beck et al 1993 found evidence that the substantive conclusions of the study were dependent on the stochastic speci cation and on the methods used for estimation
    Example 13 1 Cost Function for Airline Production

    To illustrate the computations for the various panel data models we will revisit the airline cost data used in Example 7 2 This is a panel data study of a group of U S airlines We will t a simple model for the total cost of production ln costi t 1 2 ln outputi t 3 ln fuel pricei t 4 load factori t i t Output is measured in revenue passenger miles The load factor is a rate of capacity utilization it is the average rate at which seats on the airline s planes are lled More complete models of costs include other factor prices materials capital and perhaps a quadratic term in log output to allow for variable economies of scale We have restricted the cost function to these few variables to provide a straightforward illustration Ordinary least squares regression produces the following results Estimated standard errors are given in parentheses ln costi t 9 5169 0 22924 0 88274 0 013255 ln outputi t 0 45398 0 020304 ln fuel pricei t 1 62751 0 34540 load factori t i t R2 0 9882898 s2 0 015528 e e 1 335442193 The results so far are what one might expect There are substantial economies of scale e s i t 1 0 88274 1 0 1329 The fuel price and load factors affect costs in the predictable fashions as well Fuel prices differ because of different mixes of types of planes and regional differences in supply characteristics

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    287

    13 3

    FIXED EFFECTS

    This formulation of the model assumes that differences across units can be captured in differences in the constant term 5 Each i is treated as an unknown parameter to be estimated Let yi and Xi be the T observations for the ith unit i be a T 1 column of ones and let i be associated T 1 vector of disturbances Then yi Xi i i i Collecting these terms gives X i0 1 y1 0 i y2 X 2 yn 00 Xn or y X d1 d2 dn 13 2 0 1 1 0 2 2 n i n



    where di is a dummy variable indicating the ith unit Let the nT n matrix D d1 d2 dn Then assembling all nT rows gives y X D 13 3

    This model is usually referred to as the least squares dummy variable LSDV model although the least squares part of the name refers to the technique usually used to estimate it not to the model itself This model is a classical regression model so no new results are needed to analyze it If n is small enough then the model can be estimated by ordinary least squares with K regressors in X and n columns in D as a multiple regression with K n parameters Of course if n is thousands as is typical then this model is likely to exceed the storage capacity of any computer But by using familiar results for a partitioned regression we can reduce the size of the computation 6 We write the least squares estimator of as b X MD X 1 X MD y where MD I D D D 1 D This amounts to a least squares regression using the transformed data X MD X and
    5 It

    13 4

    is also possible to allow the slopes to vary across i but this method introduces some new methodological issues as well as considerable complexity in the calculations A study on the topic is Cornwell and Schmidt 1984 Also the assumption of a xed T is only for convenience The more general case in which Ti varies across units is considered later in the exercises and in Greene 1995a Theorem 3 3

    6 See

    Greene 50240

    book

    June 18 2002

    15 28

    288

    CHAPTER 13 Models for Panel Data

    y MD y The structure of D is particularly convenient its columns are orthogonal so 0 M 0 0 0 0 M0 0 0 MD 0 Each matrix on the diagonal is M0 IT 1 ii T 13 5 0 0 M0

    Premultiplying any T 1 vector zi by M0 creates M0 zi zi z i Note that the mean is taken over only the T observations for unit i Therefore the least squares regression of MD y on MD X is equivalent to a regression of yit yi on xit xi where yi and xi are the scalar and K 1 vector of means of yit and xit over the T observations for group i 7 The dummy variable coef cients can be recovered from the other normal equation in the partitioned regression D Da D Xb D y or a D D 1 D y Xb This implies that for each i ai yi b xi The appropriate estimator of the asymptotic covariance matrix for b is Est Asy Var b s 2 X MD X 1 13 7 13 6

    which uses the second moment matrix with x s now expressed as deviations from their respective group means The disturbance variance estimator is s2
    n i 1

    yit xi t b ai 2 y MD Xb y MD Xb nT n K nT n K

    T t 1

    13 8

    The it th residual used in this computation is eit yit xi t b ai yit xi t b yi xi b yit yi xit xi b Thus the numerator in s 2 is exactly the sum of squared residuals using the least squares slopes and the data in group mean deviation form But done in this fashion one might then use nT K instead of nT n K for the denominator in computing s 2 so a correction would be necessary For the individual effects Asy Var ai 2 xi Asy Var b xi T

    so a simple estimator based on s 2 can be computed
    7 An interesting special case arises if T

    2 In the two period case you can show we leave it as an exercise that this least squares regression is done with nT 2 rst difference observations by regressing observation yi 2 yi 1 and its negative on xi 2 xi 1 and its negative

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data 13 3 1 TESTING THE SIGNIFICANCE OF THE GROUP EFFECTS

    289

    The t ratio for ai can be used for a test of the hypothesis that i equals zero This hypothesis about one speci c group however is typically not useful for testing in this regression context If we are interested in differences across groups then we can test the hypothesis that the constant terms are all equal with an F test Under the null hypothesis of equality the ef cient estimator is pooled least squares The F ratio used for this test is F n 1 nT n K
    2 2 RLSDV RPooled n 1 2 1 RLSDV nT n K

    13 9

    where LSDV indicates the dummy variable model and Pooled indicates the pooled or restricted model with only a single overall constant term Alternatively the model may have been estimated with an overall constant and n 1 dummy variables instead All other results i e the least squares slopes s 2 R2 will be unchanged but rather than estimate i each dummy variable coef cient will now be an estimate of i 1 where group 1 is the omitted group The F test that the coef cients on these n 1 dummy variables are zero is identical to the one above It is important to keep in mind however that although the statistical results are the same the interpretation of the dummy variable coef cients in the two formulations is different 8
    13 3 2 THE WITHIN AND BETWEEN GROUPS ESTIMATORS

    We can formulate a pooled regression model in three ways First the original formulation is yit xi t it In terms of deviations from the group means yit yi xit xi it i while in terms of the group means yi xi i 13 10c 13 10b 13 10a

    All three are classical regression models and in principle all three could be estimated at least consistently if not ef ciently by ordinary least squares Note that 13 10c involves only n observations the group means Consider then the matrices of sums of squares and cross products that would be used in each case where we focus only on estimation of In 13 10a the moments would accumulate variation about the overall means y and x and we would use the total sums of squares and cross products
    n T n T

    Stotal xx
    i 1 t 1

    xit x xit x

    and

    Stotal xy
    i 1 t 1

    xit x yit y

    For 13 10b since the data are in deviations already the means of yit yi and xit xi are zero The moment matrices are within groups i e variation around group means
    8 For

    a discussion of the differences see Suits 1984

    Greene 50240

    book

    June 18 2002

    15 28

    290

    CHAPTER 13 Models for Panel Data

    sums of squares and cross products
    n T n T

    Swithin xx
    i 1 t 1

    xit xi xit xi

    and

    Swithin xy
    i 1 t 1

    xit xi yit yi

    Finally for 13 10c the mean of group means is the overall mean The moment matrices are the between groups sums of squares and cross products that is the variation of the group means around the overall means
    n n

    Sbetween xx
    i 1

    T xi x xi x

    and

    Sbetween xy
    i 1

    T xi x yi y

    It is easy to verify that Stotal Swithin Sbetween xx xx xx and Stotal Swithin Sbetween xy xy xy

    Therefore there are three possible least squares estimators of corresponding to the decomposition The least squares estimator is btotal Stotal xx
    1 total Sxy

    Swithin Sbetween xx xx

    1

    Swithin Sbetween xy xy

    13 11

    The within groups estimator is bwithin Swithin xx
    1 within Sxy

    13 12

    This is the LSDV estimator computed earlier See 13 4 An alternative estimator would be the between groups estimator bbetween Sbetween xx
    1 between Sxy

    13 13

    sometimes called the group means estimator This least squares estimator of 13 10c is based on the n sets of groups means Note that we are assuming that n is at least as large as K From the preceding expressions and familiar previous results Swithin Swithin bwithin xy xx and Sbetween Sbetween bbetween xy xx

    Inserting these in 13 11 we see that the least squares estimator is a matrix weighted average of the within and between groups estimators btotal F within bwithin F between bbetween where F within Swithin Sbetween xx xx
    1 within Sxx

    13 14

    I Fbetween

    The form of this result resembles the Bayesian estimator in the classical model discussed in Section 16 2 The resemblance is more than passing it can be shown see e g Judge 1985 that F within Asy Var bwithin 1 Asy Var bbetween 1
    1

    Asy Var bwithin 1

    which is essentially the same mixing result we have for the Bayesian estimator In the weighted average the estimator with the smaller variance receives the greater weight

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data 13 3 3 FIXED TIME AND GROUP EFFECTS

    291

    The least squares dummy variable approach can be extended to include a time speci c effect as well One way to formulate the extended model is simply to add the time effect as in yit xi t i t it 13 15

    This model is obtained from the preceding one by the inclusion of an additional T 1 dummy variables One of the time effects must be dropped to avoid perfect collinearity the group effects and time effects both sum to one If the number of variables is too large to handle by ordinary regression then this model can also be estimated by using the partitioned regression 9 There is an asymmetry in this formulation however since each of the group effects is a group speci c intercept whereas the time effects are contrasts that is comparisons to a base period the one that is excluded A symmetric form of the model is yit xi t i t it where a full n and T effects are included but the restrictions i
    i t

    13 15

    t 0

    are imposed Least squares estimates of the slopes in this model are obtained by regression of y it yit yi y t y on x it xit xi x t x where the period speci c and overall means are y t 1 n
    n

    13 16

    yit
    i 1

    and

    y

    1 nT

    n

    T

    yit
    i 1 t 1

    and likewise for x t and x The overall constant and the dummy variable coef cients can then be recovered from the normal equations as m y x b i ai yi y xi x b t ct y t y x t x b
    9 The

    13 17

    matrix algebra and the theoretical development of two way effects in panel data models are complex See for example Baltagi 1995 Fortunately the practical application is much simpler The number of periods analyzed in most panel data sets is rarely more than a handful Since modern computer programs even those written strictly for microcomputers uniformly allow dozens or even hundreds of regressors almost any application involving a second xed effect can be handled just by literally including the second effect as a set of actual dummy variables

    Greene 50240

    book

    June 18 2002

    15 28

    292

    CHAPTER 13 Models for Panel Data

    The estimated asymptotic covariance matrix for b is computed using the sums of squares and cross products of x it computed in 13 16 and s2 xi t b m ai ct 2 nT n 1 T 1 K 1
    n i 1 T t 1 yit

    If one of n or T is small and the other is large then it may be simpler just to treat the smaller set as an ordinary set of variables and apply the previous results to the oneway xed effects model de ned by the larger set Although more general this model is infrequently used in practice There are two reasons First the cost in terms of degrees of freedom is often not justi ed Second in those instances in which a model of the timewise evolution of the disturbance is desired a more general model than this simple dummy variable formulation is usually used
    Example 13 2 Fixed Effects Regressions

    Table 13 1 contains the estimated cost equations with individual rm effects speci c period effects and both rm and period effects For comparison the least squares and group means results are given also The F statistic for testing the joint signi cance of the rm effects is F 5 81 0 997434 0 98829 5 57 614 1 0 997431 81

    The critical value from the F table is 2 327 so the evidence is strongly in favor of a rm speci c effect in the data The same computation for the time effects in the absence of the rm effects produces an F 14 72 statistic of 1 170 which is considerably less than the 95 percent critical value of 1 832 Thus on this basis there does not appear to be a signi cant cost difference across the different periods that is not accounted for by the fuel price variable output and load factors There is a distinctive pattern to the time effects which we will examine more closely later In the presence of the rm effects the F 14 67 ratio for the joint signi cance of the period effects is 3 149 which is larger than the table value of 1 842

    TABLE 13 1

    Cost Equations with Fixed Firm and Period Effects
    Parameter Estimates 1 2 3 4 R2
    0 98829 0 99364 0 99743 9 730 0 99046 21 200 22 616 0 99845

    Speci cation
    No effects Group means Firm effects a1 a6 Time effects c1 c8 c9 c15

    s2
    0 015528 0 015838 0 003625 9 793 0 016705 21 411 22 552 0 002727 21 503 22 537 21 654

    9 517 0 22924 85 809 56 483

    0 88274 0 45398 1 6275 0 013255 0 020304 0 34530 0 78246 5 5240 1 7510 0 10877 4 47879 2 74319 0 91928 0 41749 1 07040 0 029890 0 015199 0 20169 9 665 9 497 0 86773 0 48448 0 015408 0 36411 20 578 22 114 0 81725 0 031851 0 06549 0 31932 0 09173 20 656 22 465 0 16861 0 16348 0 18947 0 27669 0 20731 9 891 1 95440 0 44238 20 741 22 651 0 88281 0 26174 0 13425 0 22304 0 28547

    9 706

    Firm and time effects a1 a6 0 12833 0 37402 c1 c8 c9 c15 0 04722

    20 496 21 829 12 667 2 0811

    0 09265 0 04596 0 15393 0 10809 0 30138 0 30047

    0 07686 0 02073 0 31911

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data 13 3 4 UNBALANCED PANELS AND FIXED EFFECTS

    293

    Missing data are very common in panel data sets For this reason or perhaps just because of the way the data were recorded panels in which the group sizes differ across groups are not unusual These panels are called unbalanced panels The preceding analysis assumed equal group sizes and relied on the assumption at several points A modi cation to allow unequal group sizes is quite simple First the full sample size is in 1 Ti instead of nT which calls for minor modi cations in the computations of s 2 Var b Var ai and the F statistic Second group means must be based on Ti which varies across groups The overall means for the regressors are x
    n i 1 n i i 1T Ti t 1 xit



    n i i 1T xi n Ti i 1

    n


    i 1

    fi xi

    where fi Ti in 1 Ti If the group sizes are equal then fi 1 n The within groups moment matrix shown in 13 4 Swithin X MD X xx is
    n n T

    Xi Mi0 Xi
    i 1 i 1 t 1

    xit xi xit xi



    The other moments Swithin and Swithin are computed likewise No other changes are xy yy necessary for the one factor LSDV estimator The two way model can be handled likewise although with unequal group sizes in both directions the algebra becomes fairly cumbersome Once again however the practice is much simpler than the theory The easiest approach for unbalanced panels is just to create the full set of T dummy variables using as T the union of the dates represented in the full data set One presumably the last is dropped so we revert back to 13 15 Then within each group any of the T periods represented is accounted for by using one of the dummy variables Least squares using the LSDV approach for the group effects will then automatically take care of the messy accounting details

    13 4

    RANDOM EFFECTS

    The xed effects model allows the unobserved individual effects to be correlated with the included variables We then modeled the differences between units strictly as parametric shifts of the regression function This model might be viewed as applying only to the cross sectional units in the study not to additional ones outside the sample For example an intercountry comparison may well include the full set of countries for which it is reasonable to assume that the model is constant If the individual effects are strictly uncorrelated with the regressors then it might be appropriate to model the individual speci c constant terms as randomly distributed across cross sectional units This view would be appropriate if we believed that sampled cross sectional units were drawn from a large population It would certainly be the case for the longitudinal data sets listed

    Greene 50240

    book

    June 18 2002

    15 28

    294

    CHAPTER 13 Models for Panel Data

    in the introduction to this chapter 10 The payoff to this form is that it greatly reduces the number of parameters to be estimated The cost is the possibility of inconsistent estimates should the assumption turn out to be inappropriate Consider then a reformulation of the model yit xi t ui it 13 18

    where there are K regressors including a constant and now the single constant term is the mean of the unobserved heterogeneity E zi The component ui is the random heterogeneity speci c to the ith observation and is constant through time recall from Section 13 2 ui zi E zi For example in an analysis of families we can view ui as the collection of factors zi not in the regression that are speci c to that family We assume further that E it X E ui X 0
    2 E it X 2 2 E ui2 X u

    E it u j X 0 E it js X 0 E ui u j X 0

    for all i t and j if t s or i j if i j

    13 19

    As before it is useful to view the formulation of the model in blocks of T observations for group i yi Xi ui i and i For these T observations let it it ui and i i 1 i 2 i T In view of this form of it we have what is often called an error components model For this model
    2 2 E it X 2 u 2 E it is X u

    t s for all t and s if i j

    E it js X 0 For the T observations for unit i let 2 2 u 2 u
    2 u
    10 This

    E i i X Then
    2 u 2 u 2 u

    2 u 2 2 u 2 u



    2 u 2 u

    2 2 IT u iT iT 13 20

    2 2 u

    distinction is not hard and fast it is purely heuristic We shall return to this issue later See Mundlak 1978 for methodological discussion of the distinction between xed and random effects

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    295

    where iT is a T 1 column vector of 1s Since observations i and j are independent the disturbance covariance matrix for the full nT observations is 0 0 0 0 0 0 In 13 21 0
    13 4 1

    0

    0

    GENERALIZED LEAST SQUARES

    The generalized least squares estimator of the slope parameters is
    n 1 n

    X

    1

    X 1 X

    1

    y
    i 1

    Xi

    1

    Xi
    i 1

    Xi

    1

    yi

    To compute this estimator as we did in Chapter 10 by transforming the data and using ordinary least squares with the transformed data we will require 1 2 In 1 2 We need only nd 1 2 which is
    1 2



    1 I iT iT T 2
    2 T u

    where 1

    The transformation of yi and Xi for GLS is therefore y 1 y 1 y 2 y 1 2 yi y T y

    13 22

    and likewise for the rows of Xi 11 For the data set as a whole then generalized least squares is computed by the regression of these partial deviations of yit on the same transformations of xit Note the similarity of this procedure to the computation in the LSDV model which uses 1 One could interpret as the effect that would remain if were zero because the only effect would then be ui In this case the xed and random effects models would be indistinguishable so this result makes sense It can be shown that the GLS estimator is like the OLS estimator a matrix weighted average of the within and between units estimators F within bwithin I F within bbetween 12
    11 This 12 An

    13 23

    transformation is a special case of the more general treatment in Nerlove 1971b

    alternative form of this expression in which the weighing matrices are proportional to the covariance matrices of the two estimators is given by Judge et al 1985

    Greene 50240

    book

    June 18 2002

    15 28

    296

    CHAPTER 13 Models for Panel Data

    where now F within Swithin Sbetween xx xx
    1 within Sxx

    2 1 2 2 2 T u

    To the extent that differs from one we see that the inef ciency of least squares will follow from an inef cient weighting of the two estimators Compared with generalized least squares ordinary least squares places too much weight on the between units variation It includes it all in the variation in X rather than apportioning some of it to random variation across groups attributable to the variation in ui across units There are some polar cases to consider If equals 1 then generalized least squares 2 is identical to ordinary least squares This situation would occur if u were zero in which case a classical regression model would apply If equals zero then the estimator is the dummy variable estimator we used in the xed effects setting There are two possibilities If 2 were zero then all variation across units would be due to the different ui s which because they are constant across time would be equivalent to the dummy variables we used in the xed effects model The question of whether they were xed or random would then become moot They are the only source of variation across units once the regression is accounted for The other case is T We can view it this way If T then the unobserved ui becomes observable Take the T observations for the i th unit Our estimator of is consistent in the dimensions T or n Therefore yit xi t ui it becomes observable The individual means will provide yi xi ui i But i converges to zero which reveals ui to us Therefore if T goes to in nity ui becomes the i di we used earlier Unbalanced panels add a layer of dif culty in the random effects model The rst problem can be seen in 13 21 The matrix is no longer I because the diagonal blocks in are of different sizes There is also groupwise heteroscedasticity because the ith diagonal block in 1 2 is
    1 2 i

    ITi

    i iT i Ti i Ti

    i 1


    2 2 Ti u



    In principle estimation is still straightforward since the source of the groupwise heteroscedasticity is only the unequal group sizes Thus for GLS or FGLS with estimated variance components it is necessary only to use the group speci c i in the transformation in 13 22
    13 4 2 FEASIBLE GENERALIZED LEAST SQUARES IS UNKNOWN WHEN

    If the variance components are known generalized least squares can be computed as shown earlier Of course this is unlikely so as usual we must rst estimate the

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    297

    disturbance variances and then use an FGLS procedure A heuristic approach to estimation of the variance components is as follows yit xi t it ui and yi xi i ui Therefore taking deviations from the group means removes the heterogeneity yit yi xit xi it i Since
    T

    13 24

    13 25

    E
    t 1

    it i 2 T 1 2

    if were observed then an unbiased estimator of 2 based on T observations in group i would be 2 i
    T t 1 it

    i 2 T 1

    13 26

    Since must be estimated 13 25 implies that the LSDV estimator is consistent indeed unbiased in general we make the degrees of freedom correction and use the LSDV residuals in
    2 se i

    e i 2 T K 1
    n i 1

    T t 1 eit

    13 27

    We have n such estimators so we average them to obtain se 2 1 n
    n 2 se i i 1

    1 n

    n i 1

    ei 2 T K 1

    T t 1 eit



    ei 2 nT nK n

    T t 1 eit

    13 28

    The degrees of freedom correction in se is excessive because it assumes that and 2 are reestimated for each i The estimated parameters are the n means yi and the K slopes Therefore we propose the unbiased estimator13
    2 2 sLSDV n i 1

    ei 2 nT n K

    T t 1 eit

    13 29

    This is the variance estimator in the LSDV model in 13 8 appropriately corrected for degrees of freedom 2 It remains to estimate u Return to the original model speci cation in 13 24 In spite of the correlation across observations this is a classical regression model in which the ordinary least squares slopes and variance estimators are both consistent and in most cases unbiased Therefore using the ordinary least squares residuals from the
    13 A

    formal proof of this proposition may be found in Maddala 1971 or in Judge et al 1985 p 551

    Greene 50240

    book

    June 18 2002

    15 28

    298

    CHAPTER 13 Models for Panel Data

    model with only a single overall constant we have
    2 plim sPooled plim

    ee 2 2 u nT K 1

    13 30

    This provides the two estimators needed for the variance components the second would 2 2 be u sPooled sLSDV A possible complication is that this second estimator could be 2 negative But recall that for feasible generalized least squares we do not need an unbiased estimator of the variance only a consistent one As such we may drop the degrees of freedom corrections in 13 29 and 13 30 If so then the two variance estimators must be nonnegative since the sum of squares in the LSDV model cannot be larger than that in the simple regression with only one constant term Alternative estimators have been proposed all based on this principle of using two different sums of squared residuals 14 There is a remaining complication If there are any regressors that do not vary within the groups the LSDV estimator cannot be computed For example in a model of family income or labor supply one of the regressors might be a dummy variable for location family structure or living arrangement Any of these could be perfectly collinear with the xed effect for that family which would prevent computation of the LSDV estimator In this case it is still possible to estimate the random effects variance components Let b a be any consistent estimator of such as the ordinary least 2 squares estimator Then 13 30 provides a consistent estimator of mee 2 u The mean squared residuals using a regression based only on the n group means provides a 2 consistent estimator of m u 2 T so we can use 2 u 2 T mee m T 1 T 1 m mee m 1 mee T 1 T 1

    2 where 1 As before this estimator can produce a negative estimate of u that once again calls the speci cation of the model into question Note nally that the residuals in 13 29 and 13 30 could be based on the same coef cient vector

    13 4 3

    TESTING FOR RANDOM EFFECTS

    Breusch and Pagan 1980 have devised a Lagrange multiplier test for the random effects model based on the OLS residuals 15 For
    2 H0 u 0 2 H1 u

    or Corr it is 0

    0

    14 See 15 We

    for example Wallace and Hussain 1969 Maddala 1971 Fuller and Battese 1974 and Amemiya 1971 have focused thus far strictly on generalized least squares and moments based consistent estimation of the variance components The LM test is based on maximum likelihood estimation instead See Maddala 1971 and Balestra and Nerlove 1966 2003 for this approach to estimation

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    299

    the test statistic is LM nT 2 T 1

    n i 1 n i 1

    T t 1 eit T 2 t 1 eit

    2

    2 1 nT 2 T 1
    n 2 i 1 T ei n T 2 i 1 t 1 eit

    2

    1 13 31

    Under the null hypothesis LM is distributed as chi squared with one degree of freedom
    Example 13 3 Testing for Random Effects

    The least squares estimates for the cost equation were given in Example 13 1 The rm speci c means of the least squares residuals are e 0 068869 0 013878 0 19422 0 15273 0 021583 0 0080906 The total sum of squared residuals for the least squares regression is e e 1 33544 so LM nT 2 T 1 T 2e e 1 ee
    2

    334 85

    Based on the least squares residuals we obtain a Lagrange multiplier test statistic of 334 85 which far exceeds the 95 percent critical value for chi squared with one degree of freedom 3 84 At this point we conclude that the classical regression model with a single constant term is inappropriate for these data The result of the test is to reject the null hypothesis in favor of the random effects model But it is best to reserve judgment on that because there is another competing speci cation that might induce these same results the xed effects model We will examine this possibility in the subsequent examples

    With the variance estimators in hand FGLS can be used to estimate the parameters of the model All our earlier results for FGLS estimators apply here It would also be possible to obtain the maximum likelihood estimator 16 The likelihood function is complicated but as we have seen repeatedly the MLE of will be GLS based on the maximum likelihood estimators of the variance components It can be shown that the 2 MLEs of 2 and u are the unbiased estimators shown earlier without the degrees of freedom corrections 17 This model satis es the requirements for the Oberhofer Kmenta 1974 algorithm see Section 11 7 2 so we could also use the iterated FGLS procedure to obtain the MLEs if desired The initial consistent estimators can be based on least squares residuals Still other estimators have been proposed None will have better asymptotic properties than the MLE or FGLS estimators but they may outperform them in a nite sample 18
    Example 13 4 Random Effects Models

    To compute the FGLS estimator we require estimates of the variance components The unbiased estimator of 2 is the residual variance estimator in the within units LSDV regression Thus 2 0 2926222 0 0036126 90 9

    16 See 17 See 18 See

    Hsiao 1986 and Nerlove 2003 Berzeg 1979 Maddala and Mount 1973

    Greene 50240

    book

    June 18 2002

    15 28

    300

    CHAPTER 13 Models for Panel Data

    Using the least squares residuals from the pooled regression we have
    2 2 u

    1 335442 0 015528 90 4

    so
    2 u 0 015528 0 0036126 0 0199158

    For purposes of FGLS 1 0 0036126 15 0 0199158
    1 2

    0 890032

    The FGLS estimates for this random effects model are shown in Table 13 2 with the xed effects estimates The estimated within groups variance is larger than the between groups variance by a factor of ve Thus by these estimates over 80 percent of the disturbance variation is explained by variation within the groups with only the small remainder explained by variation across groups

    None of the desirable properties of the estimators in the random effects model rely on T going to in nity 19 Indeed T is likely to be quite small The maximum likelihood estimator of 2 is exactly equal to an average of n estimators each based on the T observations for unit i See 13 28 Each component in this average is in principle consistent That is its variance is of order 1 T or smaller Since T is small this variance may be relatively large But each term provides some information about the parameter The average over the n cross sectional units has a variance of order 1 nT which will go to zero if n increases even if we regard T as xed The conclusion to draw is that nothing in this treatment relies on T growing large Although it can be shown that some consistency results will follow for T increasing the typical panel data set is based on data sets for which it does not make sense to assume that T increases without bound or in some cases at all 20 As a general proposition it is necessary to take some care in devising estimators whose properties hinge on whether T is large or not The widely used conventional ones we have discussed here do not but we have not exhausted the possibilities The LSDV model does rely on T increasing for consistency To see this we use the partitioned regression The slopes are b X MD X 1 X Md y Since X is nT K as long as the inverted moment matrix converges to a zero matrix b is consistent as long as either n or T increases without bound But the dummy variable coef cients are ai yi xi b 1 T
    T

    yit xi t b
    t 1

    We have already seen that b is consistent Suppose for the present that xi 0 Then Var ai Var yit T Therefore unless T the estimators of the unit speci c effects are not consistent They are however best linear unbiased This inconsistency is worth bearing in mind when analyzing data sets for which T is xed and there is no intention
    19 See 20 In

    Nickell 1981

    this connection Chamberlain 1984 provided some innovative treatments of panel data that in fact take T as given in the model and that base consistency results solely on n increasing Some additional results for dynamic models are given by Bhargava and Sargan 1983

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    301

    to replicate the study and no logical argument that would justify the claim that it could have been replicated in principle The random effects model was developed by Balestra and Nerlove 1966 Their formulation included a time speci c component t as well as the individual effect yit xit it ui t The extended formulation is rather complicated analytically In Balestra and Nerlove s study it was made even more so by the presence of a lagged dependent variable that causes all the problems discussed earlier in our discussion of autocorrelation A full set of results for this extended model including a method for handling the lagged dependent variable has been developed 21 We will turn to this in Section 13 7
    13 4 4 HAUSMAN S SPECIFICATION TEST FOR THE RANDOM EFFECTS MODEL

    At various points we have made the distinction between xed and random effects models An inevitable question is Which should be used From a purely practical standpoint the dummy variable approach is costly in terms of degrees of freedom lost On the other hand the xed effects approach has one considerable virtue There is little justi cation for treating the individual effects as uncorrelated with the other regressors as is assumed in the random effects model The random effects treatment therefore may suffer from the inconsistency due to this correlation between the included variables and the random effect 22 The speci cation test devised by Hausman 1978 23 is used to test for orthogonality of the random effects and the regressors The test is based on the idea that under the hypothesis of no correlation both OLS in the LSDV model and GLS are consistent but OLS is inef cient 24 whereas under the alternative OLS is consistent but GLS is not Therefore under the null hypothesis the two estimates should not differ systematically and a test can be based on the difference The other essential ingredient for the test is the covariance matrix of the difference vector b Var b Var b Var Cov b Cov b 13 32

    Hausman s essential result is that the covariance of an ef cient estimator with its difference from an inef cient estimator is zero which implies that Cov b Cov b Var 0 or that Cov b Var Inserting this result in 13 32 produces the required covariance matrix for the test Var b Var b Var
    21 See 22 See



    13 33

    Balestra and Nerlove 1966 Fomby Hill and Johnson 1984 Judge et al 1985 Hsiao 1986 Anderson and Hsiao 1982 Nerlove 1971a 2003 and Baltagi 1995 Hausman and Taylor 1981 and Chamberlain 1978 results are given by Baltagi 1986

    23 Related

    24 Referring to the GLS matrix weighted average given earlier we see that the ef cient weight uses whereas

    OLS sets 1

    Greene 50240

    book

    June 18 2002

    15 28

    302

    CHAPTER 13 Models for Panel Data

    The chi squared test is based on the Wald criterion W 2 K 1 b 1 b 13 34

    For we use the estimated covariance matrices of the slope estimator in the LSDV model and the estimated covariance matrix in the random effects model excluding the constant term Under the null hypothesis W has a limiting chi squared distribution with K 1 degrees of freedom
    Example 13 5 Hausman Test

    The Hausman test for the xed and random effects regressions is based on the parts of the coef cient vectors and the asymptotic covariance matrices that correspond to the slopes in the models that is ignoring the constant term s The coef cient estimates are given in Table 13 2 The two estimated asymptotic covariance matrices are 0 0008934 Est Var b F E 0 0003178 0 001884 0 0003178 0 0002310 0 0007686 0 001884 0 0007686 0 04068

    TABLE 13 2

    Random and Fixed Effects Estimates
    Parameter Estimates 1 2 3 4 R2 s2

    Speci cation

    No effects Firm effects

    9 517 0 22924 Fixed effects

    0 88274 0 45398 0 013255 0 020304 0 91930 0 029890 0 019105 0 027977 0 41749 0 015199 0 013533 0 013802

    1 6275 0 98829 0 34530 1 0704 0 99743 0 20169 0 21662 0 20372

    0 015528

    0 0036125

    White 1 White 2

    Fixed effects with autocorrelation 0 5162 0 92975 0 38567 1 22074 0 033927 0 0167409 0 20174 s 2 1 2 0 002807 Random effects 9 6106 0 90412 0 20277 0 02462 0 42390 0 01375 1 0646 0 1993 u 0 0119158 2 2 0 00361262

    0 0019179

    Random effects with autocorrelation 0 5162 10 139 0 91269 0 39123 1 2074 u 0 0268079 2 0 2587 0 027783 0 016294 0 19852 2 0 0037341 Firm and time Fixed effects effects 12 667 0 81725 0 16861 2 0811 0 031851 0 16348 Random effects 9 799 0 84328 0 38760 0 87910 0 025839 0 06845 0 88281 0 99845 0 26174 0 92943 u 0 0142291 2 0 25721 2 0 0026395 v2 0 0551958 0 0026727

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    303

    and 0 0006059 Est Var b RE 0 0002089 0 001450 0 0002089 0 00018897 0 002141 0 001450 0 002141 0 03973

    The test statistic is 4 16 The critical value from the chi squared table with three degrees of freedom is 7 814 which is far larger than the test value The hypothesis that the individual effects are uncorrelated with the other regressors in the model cannot be rejected Based on the LM test which is decisive that there are individual effects and the Hausman test which suggests that these effects are uncorrelated with the other variables in the model we would conclude that of the two alternatives we have considered the random effects model is the better choice

    13 5

    INSTRUMENTAL VARIABLES ESTIMATION OF THE RANDOM EFFECTS MODEL

    Recall the original speci cation of the linear model for panel data in 13 1 yit xi t zi it 13 35

    The random effects model is based on the assumption that the unobserved person speci c effects zi are uncorrelated with the included variables xit This assumption is a major shortcoming of the model However the random effects treatment does allow the model to contain observed time invariant characteristics such as demographic characteristics while the xed effects model does not if present they are simply absorbed into the xed effects Hausman and Taylor s 1981 estimator for the random effects model suggests a way to overcome the rst of these while accommodating the second Their model is of the form yit x1it 1 x2it 2 z1i 1 z2i 2 it ui where 1 2 and 1 2 In this formulation all individual effects denoted zi are observed As before unobserved individual effects that are contained in zi in 13 35 are contained in the person speci c random term ui Hausman and Taylor de ne four sets of observed variables in the model x1it is K1 variables that are time varying and uncorrelated with ui z1i is L1 variables that are time invariant and uncorrelated with ui x2it is K2 variables that are time varying and are correlated with ui z2i is L2 variables that are time invariant and are correlated with ui The assumptions about the random terms in the model are E ui E ui x1it z1i 0 though E ui x2it z2i 0
    2 Var ui x1it z1i x2it z2i u

    Cov it ui x1it z1i x2it z2i 0
    2 Var it ui x1it z1i x2it z2i 2 2 u 2 Corr it ui is ui x1it z1i x2it z2i u 2

    Greene 50240

    book

    June 18 2002

    15 28

    304

    CHAPTER 13 Models for Panel Data

    Note the crucial assumption that one can distinguish sets of variables x1 and z1 that are uncorrelated with ui from x2 and z2 which are not The likely presence of x2 and z2 is what complicates speci cation and estimation of the random effects model in the rst place By construction any OLS or GLS estimators of this model are inconsistent when the model contains variables that are correlated with the random effects Hausman and Taylor have proposed an instrumental variables estimator that uses only the information within the model i e as already stated The strategy for estimation is based on the following logic First by taking deviations from group means we nd that yit yi x1it x1i 1 x2it x2i 2 it i 13 36

    which implies that can be consistently estimated by least squares in spite of the correlation between x2 and u This is the familiar xed effects least squares dummy variable estimator the transformation to deviations from group means removes from the model the part of the disturbance that is correlated with x2it Now in the original model Hausman and Taylor show that the group mean deviations can be used as K1 K2 instrumental variables for estimation of That is the implication of 13 36 Since z1 is uncorrelated with the disturbances it can likewise serve as a set of L1 instrumental variables That leaves a necessity for L2 instrumental variables The authors show that the group means for x1 can serve as these remaining instruments and the model will be identi ed so long as K1 is greater than or equal to L2 For identi cation purposes then K1 must be at least as large as L2 As usual feasible GLS is better than OLS and available Likewise FGLS is an improvement over simple instrumental variable estimation of the model which is consistent but inef cient The authors propose the following set of steps for consistent and ef cient estimation Step 1 Obtain the LSDV xed effects estimator of 1 2 based on x1 and x2 The residual variance estimator from this step is a consistent estimator of 2 Step 2 Form the within groups residuals eit from the LSDV regression at step 1 Stack the group means of these residuals in a full sample length data vector Thus eit eii t 1 T i 1 n These group means are used as the dependent vari able in an instrumental variable regression on z1 and z2 with instrumental variables z1 and x1 Note the identi cation requirement that K1 the number of variables in x1 be at least as large as L2 the number of variables in z2 The time invariant variables are each repeated T times in the data matrices in this regression This provides a consistent estimator of Step 3 The residual variance in the regression in step 2 is a consistent estimator of 2 2 u 2 T From this estimator and the estimator of 2 in step 1 we deduce an 2 estimator of u 2 2 T We then form the weight for feasible GLS in this model by forming the estimate of 2 2 2 T u

    Step 4 The nal step is a weighted instrumental variable estimator Let the full set of variables in the model be wi t x1it x2it z1i z2i

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    305

    Collect these nT observations in the rows of data matrix W The transformed variables for GLS are as before when we rst t the random effects model wi t wi t 1 wi and
    yit yit 1 yi

    where denotes the sample estimate of The transformed data are collected in the rows data matrix W and in column vector y Note in the case of the time invariant variables in wit the group mean is the original variable and the transformation just multiplies the variable by The instrumental variables are vi t x1it x1i x2it x2i z1i x1i These are stacked in the rows of the nT K1 K2 L1 K1 matrix V Note for the third and fourth sets of instruments the time invariant variables and group means are repeated for each member of the group The instrumental variable estimator would be IV W V V V 1 V W 1 W V V V 1 V y 25 13 37

    The instrumental variable estimator is consistent if the data are not weighted that is if W rather than W is used in the computation But this is inef cient in the same way that OLS is consistent but inef cient in estimation of the simpler random effects model
    Example 13 6 The Returns to Schooling

    The economic returns to schooling have been a frequent topic of study by econometricians The PSID and NLS data sets have provided a rich source of panel data for this effort In wage or log wage equations it is clear that the economic bene ts of schooling are correlated with latent unmeasured characteristics of the individual such as innate ability intelligence drive or perseverance As such there is little question that simple random effects models based on panel data will suffer from the effects noted earlier The xed effects model is the obvious alternative but these rich data sets contain many useful variables such as race union membership and marital status which are generally time invariant Worse yet the variable most of interest years of schooling is also time invariant Hausman and Taylor 1981 proposed the estimator described here as a solution to these problems The authors studied the effect of schooling on the log of wages using a random sample from the PSID of 750 men aged 25 55 observed in two years 1968 and 1972 The two years were chosen so as to minimize the effect of serial correlation apart from the persistent unmeasured individual effects The variables used in their model were as follows Experience age years of schooling 5 Years of schooling Bad Health a dummy variable indicating general health Race a dummy variable indicating nonwhite 70 of 750 observations Union a dummy variable indicating union membership Unemployed a dummy variable indicating previous year s unemployment The model also included a constant term and a period indicator The coding of the latter is not given but any two distinct values including 0 for 1968 and 1 for 1972 would produce identical results Why The primary focus of the study is the coef cient on schooling in the log wage equation Since schooling and probably Experience and Unemployed are correlated with the latent
    25 Note

    that the FGLS random effects estimator would be RE W W 1 W y

    Greene 50240

    book

    June 18 2002

    15 28

    306

    CHAPTER 13 Models for Panel Data

    TABLE 13 3 Variables

    Estimated Log Wage Equations
    OLS GLS RE LSDV HT IV GLS HT IV GLS

    x1

    Experience Bad health Unemployed Last Year Time Experience Unemployed

    0 0132 0 0011 a 0 0843 0 0412 0 0015 0 0267 NRb

    0 0133 0 0017 0 0300 0 0363 0 0402 0 0207 NR

    0 0241 0 0042 0 0388 0 0460 0 0560 0 0295 NR

    0 0217 0 0031 0 0278 0 0307 0 0559 0 0246 NR

    0 0388 0 0348 NR 0 0241 0 0045 0 0560 0 0279 0 0175 0 0764 0 2240 0 2863 NR 0 2169 0 0979 0 629 0 817 0 00

    x2

    z1

    Race Union Schooling

    z2

    Constant Schooling 2 2 u u 2 Spec Test 3

    0 0853 0 0328 0 0450 0 0191 0 0669 0 0033 NR 0 321

    0 0878 0 0518 0 0374 0 0296 0 0676 0 0052 NR 0 192 0 632 20 2

    0 0278 0 0752 0 1227 0 0473 NR 0 160 NR 0 1246 0 0434 0 190 0 661 2 24

    a Estimated b NR

    asymptotic standard errors are given in parentheses indicates that the coef cient estimate was not reported in the study

    effect there is likely to be serious bias in conventional estimates of this equation Table 13 3 reports some of their reported results The OLS and random effects GLS results in the rst two columns provide the benchmark for the rest of the study The schooling coef cient is estimated at 0 067 a value which the authors suspected was far too small As we saw earlier even in the presence of correlation between measured and latent effects in this model the LSDV estimator provides a consistent estimator of the coef cients on the time varying variables Therefore we can use it in the Hausman speci cation test for correlation between the included variables and the latent heterogeneity The calculations are shown in Section 13 4 4 result 13 34 Since there are three variables remaining in the LSDV equation the chi squared statistic has three degrees of freedom The reported value of 20 2 is far larger than the 95 percent critical value of 7 81 so the results suggest that the random effects model is misspeci ed Hausman and Taylor proceeded to reestimate the log wage equation using their proposed estimator The fourth and fth sets of results in Table 13 3 present the instrumental variable estimates The speci cation test given with the fourth set of results suggests that the procedure has produced the desired result The hypothesis of the modi ed random effects model is now not rejected the chi squared value of 2 24 is much smaller than the critical value The schooling variable is treated as endogenous correlated with ui in both cases The difference between the two is the treatment of Unemployed and Experience In the preferred equation they are included in z2 rather than z1 The end result of the exercise is again the coef cient on schooling which has risen from 0 0669 in the worst speci cation OLS to 0 2169 in the last one a difference of over 200 percent As the authors note at the same time the measured effect of race nearly vanishes

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    307

    13 6

    GMM ESTIMATION OF DYNAMIC PANEL DATA MODELS

    Panel data are well suited for examining dynamic effects as in the rst order model yit xi t yi t 1 i it wi t i it where the set of right hand side variables wit now includes the lagged dependent variable yi t 1 Adding dynamics to a model in this fashion is a major change in the interpretation of the equation Without the lagged variable the independent variables represent the full set of information that produce observed outcome yit With the lagged variable we now have in the equation the entire history of the right hand side variables so that any measured in uence is conditioned on this history in this case any impact of xit represents the effect of new information Substantial complications arise in estimation of such a model In both the xed and random effects settings the dif culty is that the lagged dependent variable is correlated with the disturbance even if it is assumed that it is not itself autocorrelated For the moment consider the xed effects model as an ordinary regression with a lagged dependent variable We considered this case in Section 5 3 2 as a regression with a stochastic regressor that is dependent across observations In that dynamic regression model the estimator based on T observations is biased in nite samples but it is consistent in T That conclusion was the main result of Section 5 3 2 The nite sample bias is of order 1 T The same result applies here but the difference is that whereas before we obtained our large sample results by allowing T to grow large in this setting T is assumed to be small and xed and large sample results are obtained with respect to n growing large not T The xed effects estimator of can be viewed as an average of n such estimators Assume for now that T K 1 where K is the number of variables in xit Then from 13 4
    n 1 n


    i 1 n

    Wi M Wi
    i 1 1 n

    0

    Wi M0 yi


    i 1 n

    Wi M Wi
    i 1

    0

    Wi M0 Wi di


    i 1

    Fi di

    where the rows of the T K 1 matrix Wi are wi t and M0 is the T T matrix that creates deviations from group means see 13 5 Each group speci c estimator di is inconsistent as it is biased in nite samples and its variance does not go to zero as n increases This matrix weighted average of n inconsistent estimators will also be inconsistent This analysis is only heuristic If T K 1 then the individual coef cient vectors cannot be computed 26
    26 Further

    discussion is given by Nickell 1981 Ridder and Wansbeek 1990 and Kiviet 1995

    Greene 50240

    book

    June 18 2002

    15 28

    308

    CHAPTER 13 Models for Panel Data

    The problem is more transparent in the random effects model In the model yit yi t 1 xi t ui it the lagged dependent variable is correlated with the compound disturbance in the model since the same ui enters the equation for every observation in group i Neither of these results renders the model inestimable but they do make necessary some technique other than our familiar LSDV or FGLS estimators The general approach which has been developed in several stages in the literature 27 relies on instrumental variables estimators and most recently by Arellano and Bond 1991 and Arellano and Bover 1995 on a GMM estimator For example in either the xed or random effects cases the heterogeneity can be swept from the model by taking rst differences which produces yit yi t 1 yi t 1 yi t 2 xit xi t 1 it i t 1 This model is still complicated by correlation between the lagged dependent variable and the disturbance and by its rst order moving average disturbance But without the group effects there is a simple instrumental variables estimator available Assuming that the time series is long enough one could use the lagged differences yi t 2 yi t 3 or the lagged levels yi t 2 and yi t 3 as one or two instrumental variables for yi t 1 yi t 2 The other variables can serve as their own instruments By this construction then the treatment of this model is a standard application of the instrumental variables technique that we developed in Section 5 4 28 This illustrates the avor of an instrumental variable approach to estimation But as Arellano et al and Ahn and Schmidt 1995 have shown there is still more information in the sample which can be brought to bear on estimation in the context of a GMM estimator which we now consider We extend the Hausman and Taylor HT formulation of the random effects model to include the lagged dependent variable yit yi t 1 x1it 1 x2it 2 z1i 1 z2i 2 it ui wit it ui wit it where wit yi t 1 x1it x2it z1i z2i is now a 1 K1 K2 L1 L2 1 vector The terms in the equation are the same as in the Hausman and Taylor model Instrumental variables estimation of the model without the lagged dependent variable is discussed in the previous section on the HT estimator Moreover by just including yi t 1 in x2it we see that the HT approach extends to this setting as well essentially without modi cation Arellano et al suggest a GMM estimator and show that ef ciency gains are available by using a larger set of moment
    27 The model was rst proposed in this form by Balestra and Nerlove 1966 See for example Anderson and

    Hsiao 1981 1982 Bhargava and Sargan 1983 Arellano 1989 Arellano and Bond 1991 Arellano and Bover 1995 Ahn and Schmidt 1995 and Nerlove 2003
    28 There is a question as to whether one should use differences or levels as instruments Arellano 1989 gives

    evidence that the latter is preferable

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    309

    conditions In the previous treatment we used a GMM estimator constructed as follows The set of moment conditions we used to formulate the instrumental variables were x1it x1it x x E 2it it i E 2it it i 0 z1i z1i x1i x1i This moment condition is used to produce the instrumental variable estimator We could ignore the nonscalar variance of it and use simple instrumental variables at this point However by accounting for the random effects formulation and using the counterpart to feasible GLS we obtain the more ef cient estimator in 13 37 As usual this can be done in two steps The inef cient estimator is computed in order to obtain the residuals needed to estimate the variance components This is Hausman and Taylor s steps 1 and 2 Steps 3 and 4 are the GMM estimator based on these estimated variance components Arellano et al suggest that the preceding does not exploit all the information in the sample In simple terms within the T observations in group i we have not used the fact that x1it x E 2it is i 0 for some s t z1i x1i Thus for example not only are disturbances at time t uncorrelated with these variables at time t arguably they are uncorrelated with the same variables at time t 1 t 2 possibly t 1 and so on In principle the number of valid instruments is potentially enormous Suppose for example that the set of instruments listed above is strictly exogenous with respect to it in every period including current lagged and future Then there are a total of T K1 K2 L1 K1 moment conditions for every observation on this basis alone Consider for example a panel with two periods We would have for the two periods x1i 1 x1i 1 x2i 1 x2i 1 x1i 2 x1i 2 i 1 i E i 2 i 0 E 13 38 x2i 2 x2i 2 z1i z1i x1i x1i How much useful information is brought to bear on estimation of the parameters is uncertain as it depends on the correlation of the instruments with the included exogenous variables in the equation The farther apart in time these sets of variables become the less information is likely to be present The literature on this subject contains reference to strong versus weak instrumental variables 29 In order to proceed as noted we can include the lagged dependent variable in x2i This set of instrumental variables can be used to construct the estimator actually whether the lagged variable is present or not We note at this point that on this basis Hausman and Taylor s estimator did not
    29 See

    West 2001

    Greene 50240

    book

    June 18 2002

    15 28

    310

    CHAPTER 13 Models for Panel Data

    actually use all the information available in the sample We now have the elements of the Arellano et al estimator in hand what remains is essentially the unfortunately fairly involved algebra which we now develop Let wi 1 yi 1 wi 2 yi 2 Wi the full set of rhs data for group i and yi wi Ti yi T Note that Wi is assumed to be a T 1 K1 K2 L1 L2 matrix Since there is a lagged dependent variable in the model it must be assumed that there are actually T 1 observations available on yit To avoid a cumbersome cluttered notation we will leave this distinction embedded in the notation for the moment Later when necessary we will make it explicit It will reappear in the formulation of the instrumental variables A total of T observations will be available for constructing the IV estimators We now form a matrix of instrumental variables Different approaches to this have been considered by Hausman and Taylor 1981 Arellano et al 1991 1995 1999 Ahn and Schmidt 1995 and Amemiya and MaCurdy 1986 among others We will form a matrix Vi consisting of Ti 1 rows constructed the same way for Ti 1 observations and a nal row that will be different as discussed below This is to exploit a useful algebraic result discussed by Arellano and Bover 1995 The matrix will be of the form vi 1 0 0 0 vi 2 0 Vi 13 39 0 0 ai

    The instrumental variable sets contained in vi t which have been suggested might include the following from within the model xit and xi t 1 i e current and one lag of all the time varying variables xi 1 xiT i e all current past and future values of all the time varying variables xi 1 xit i e all current and past values of all the time varying variables The time invariant variables that are uncorrelated with ui that is z1i are appended at the end of the nonzero part of each of the rst T 1 rows It may seem that including x2 in the instruments would be invalid However we will be converting the disturbances to deviations from group means which are free of the latent effects that is this set of moment conditions will ultimately be converted to what appears in 13 38 While the variables are correlated with ui by construction they are not correlated with it i The nal row of Vi is important to the construction Two possibilities have been suggested ai z1i ai z1i xi 1 produces the Hausman and Taylor estimator x1i 1 x1i 2 x1i T produces Amemiya and MaCurdy s estimator

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    311

    Note that the m variables are exogenous time invariant variables z1i and the exogenous time varying variables either condensed into the single group mean or in the raw form with the full set of T observations To construct the estimator we will require a transformation matrix H constructed as follows Let M01 denote the rst T 1 rows of M0 the matrix that creates deviations from group means Then M01 H 1 i TT Thus H replaces the last row of M0 with a row of 1 T The effect is as follows if q is T observations on a variable then Hq produces q in which the rst T 1 observations are converted to deviations from group means and the last observation is the group mean In particular let the T 1 column vector of disturbances i i 1 i 2 i T i 1 ui i 2 ui i T ui then i 1 i H i T 1 i i We can now construct the moment conditions With all this machinery in place we have the result that appears in 13 40 that is E Vi H i E gi 0 It is useful to expand this for a particular case Suppose T 3 and we use as instruments the current values in Period 1 and the current and previous values in Period 2 and the Hausman and Taylor form for the invariant variables Then the preceding is x1i 1 x2i 1 z1i 0 0 E 0 0 0 0 0 0 0 0 x1i 1 0 i x2i 1 0 i 1 i 2 i 0 x1i 2 0 i x2i 2 0 z1i 0 0 z1i 0 x1i 0 0 0

    13 40

    Greene 50240

    book

    June 18 2002

    15 28

    312

    CHAPTER 13 Models for Panel Data

    This is the same as 13 38 30 The empirical moment condition that follows from this is plim 1 n
    n

    Vi H i
    i 1


    n i 1

    plim

    1 n

    Vi H

    yi T yi T 1 x1i T 1 x2i T 2 z1i 1 z2i 2 1 n
    n

    yi 1 yi 0 x1i 1 1 x2i 1 2 z1i 1 z2i 2 yi 2 yi 1 x1i 2 1 x2i 2 2 z1i 1 z2i 2

    0

    Write this as plim mi plim m 0
    i 1

    The GMM estimator is then obtained by minimizing q m Am with an appropriate choice of the weighting matrix A The optimal weighting matrix will be the inverse of the asymptotic covariance matrix of n m With a consistent estimator of in hand this can be estimated empirically using 1 Est Asy Var n m n
    n

    mi mi
    i 1

    1 n

    n

    Vi H i i H Vi
    i 1

    This is a robust estimator that allows an unrestricted T T covariance matrix for the T disturbances it ui But we have assumed that this covariance matrix is the de ned in 13 20 for the random effects model To use this information we would instead use the residuals in i yi Wi
    2 to estimate u and 2 and then

    which produces
    n

    1 Est Asy Var n m n

    Vi H H Vi
    i 1

    We now have the full set of results needed to compute the GMM estimator The solution to the optimization problem of minimizing q with respect to the parameter vector is 1 GMM
    n n 1 n

    Wi HVi
    i 1 n i 1 n

    Vi H HVi
    i 1 1 n

    Vi H Wi


    i 1

    Wi HVi
    i 1

    Vi H HVi
    i 1

    Vi H yi



    13 41

    The estimator of the asymptotic covariance matrix for is the inverse matrix in brackets
    30 In some treatments e g Blundell and Bond 1998 an additional condition is assumed for the initial value

    yi 0 namely E yi 0 exogenous data 0 This would add a row at the top of the matrix in 13 38 containing yi 0 0 0 0

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    313

    The remaining loose end is how to obtain the consistent estimator of to compute Recall that the GMM estimator is consistent with any positive de nite weighting matrix A in our expression above Therefore for an initial estimator we could set A I and use the simple instrumental variables estimator
    N N 1 N N

    IV
    i 1

    Wi HVi
    i 1

    Vi HWi
    i 1

    Wi HVi
    i 1

    Vi Hyi



    It is more common to proceed directly to the two stage least squares estimator see Chapter 15 which uses A 1 n
    n 1

    Vi H HVi
    i 1



    The estimator is then the one given earlier in 13 41 with replace by IT Either estimator is a function of the sample data only and provides the initial estimator we need Ahn and Schmidt among others observed that the IV estimator proposed here as extensive as it is still neglects quite a lot of information and is therefore relatively inef cient For example in the rst differenced model E yis it i t 1 0 s 0 t 2 t 2 T

    That is the level of yis is uncorrelated with the differences of disturbances that are at least two periods subsequent 31 The differencing transformation as the transformation to deviations from group means removes the individual effect The corresponding moment equations that can enter the construction of a GMM estimator are 1 n
    n

    yis yit yi t 1 yi t 1 yi t 2 xit xi t 1 0
    i 1

    s 0 t 2 t 2 T Altogether Ahn and Schmidt identify T T 1 2 T 2 such equations that involve mixtures of the levels and differences of the variables The main conclusion that they demonstrate is that in the dynamic model there is a large amount of information to be gleaned not only from the familiar relationships among the levels of the variables but also from the implied relationships between the levels and the rst differences The issue of correlation between the transformed yit and the deviations of it is discussed in the papers cited As Ahn and Schmidt show there are potentially huge numbers of additional orthogonality conditions in this model owing to the relationship between rst differences and second moments We do not consider those The matrix Vi could be huge Consider a model with 10 time varying right hand side variables and suppose Ti is 15 Then there are 15 rows and roughly 15 10 15 or 2 250 columns The Ahn and Schmidt estimator which involves potentially thousands of instruments in a model containing only a handful of parameters may become a bit impractical at this point The common approach is to use only a small subset of the available instrumental
    31 This

    is the approach suggested by Holtz Eakin 1988 and Holtz Eakin Newey and Rosen 1988

    Greene 50240

    book

    June 18 2002

    15 28

    314

    CHAPTER 13 Models for Panel Data

    variables The order of the computation grows as the number of parameters times the square of T The number of orthogonality conditions instrumental variables used to estimate the parameters of the model is determined by the number of variables in vit and ai in 13 39 In most cases the model is vastly overidenti ed there are far more orthogonality conditions than parameters As usual in GMM estimation a test of the overidentifying restrictions can be based on q the estimation criterion At its minimum the limiting distribution of q is chi squared with degrees of freedom equal to the number of instrumental variables in total minus 1 K1 K2 L1 L2 32
    Example 13 7 Local Government Expenditure

    Dahlberg and Johansson 2000 estimated a model for the local government expenditure of several hundred municipalities in Sweden observed over the nine year period t 1979 to 1987 The equation of interest is
    m m m

    Si t t
    j 1

    j Si t j
    j 1

    j Ri t j
    j 1

    j Gi t j fi i t

    We have changed their notation slightly to make it more convenient Si t Ri t and Gi t are municipal spending receipts taxes and fees and central government grants respectively Analogous equations are speci ed for the current values of Ri t and Gi t The appropriate lag length m is one of the features of interest to be determined by the empirical study Note that the model contains a municipality speci c effect fi which is not speci ed as being either xed or random In order to eliminate the individual effect the model is converted to rst differences The resulting equation has dependent variable Si t Si t Si t 1 and a moving average disturbance i t i t i t 1 Estimation is done using the methods developed by Ahn and Schmidt 1995 Arellano and Bover 1995 and Holtz Eakin Newey and Rosen 1988 as described previously Issues of interest are the lag length the parameter estimates and Granger causality tests which we will revisit again using this application in Chapter 19 We will examine this application in detail and obtain some estimates in the continuation of this example in Section 18 5 GMM Estimation

    13 7

    NONSPHERICAL DISTURBANCES AND ROBUST COVARIANCE ESTIMATION

    Since the models considered here are extensions of the classical regression model we can treat heteroscedasticity in the same way that we did in Chapter 11 That is we can compute the ordinary or feasible generalized least squares estimators and obtain an appropriate robust covariance matrix estimator or we can impose some structure on the disturbance variances and use generalized least squares In the panel data settings there is greater exibility for the second of these without making strong assumptions about the nature of the heteroscedasticity We will discuss this model under the heading of covariance structures in Section 13 9 In this section we will consider robust estimation of the asymptotic covariance matrix for least squares
    13 7 1 ROBUST ESTIMATION OF THE FIXED EFFECTS MODEL

    In the xed effects model the full regressor matrix is Z X D The White heteroscedasticity consistent covariance matrix for OLS that is for the xed effects
    32 This

    is true generally in GMM estimation It was proposed for the dynamic panel data model by Bhargava and Sargan 1983

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    315

    estimator is the lower right block of the partitioned matrix Est Asy Var b a Z Z 1 Z E2 Z Z Z 1 where E is a diagonal matrix of least squares xed effects estimator residuals This computation promises to be formidable but fortunately it works out very simply The White estimator for the slopes is obtained just by using the data in group mean deviation form see 13 4 and 13 8 in the familiar computation of S0 see 11 7 to 11 9 Also the disturbance variance estimator in 13 8 is the counterpart to the one in 11 3 which we showed that after the appropriate scaling of was a consistent estimator 2 of 2 plim 1 nT in 1 tT 1 it The implication is that we may still use 13 8 to estimate the variances of the xed effects A somewhat less general but useful simpli cation of this result can be obtained if 2 we assume that the disturbance variance is constant within the ith group If E it i2 2 then with a panel of data i is estimable by ei ei T using the least squares residuals This heteroscedastic regression model was considered at various points in Section 11 7 2 The center matrix in Est Asy Var b a may be replaced with i ei ei T Zi Zi Whether this estimator is preferable is unclear If the groupwise model is correct then it and the White estimator will estimate the same matrix On the other hand if the disturbance variances do vary within the groups then this revised computation may be inappropriate Arellano 1987 has taken this analysis a step further If one takes the ith group as a whole then we can treat the observations in yi Xi i iT i as a generalized regression model with disturbance covariance matrix i We saw in Section 11 4 that a model this general with no structure on offered little hope for estimation robust or otherwise But the problem is more manageable with a panel data set As before let Xi denote the data in group mean deviation form The counterpart to X X here is
    n

    X X
    i 1

    Xi

    i Xi

    By the same reasoning that we used to construct the White estimator in Chapter 12 we can consider estimating i with the sample of one ei ei As before it is not consistent estimation of the individual i s that is at issue but estimation of the sum If n is large enough then we could argue that plim 1 1 X X plim nT nT plim 1 n
    n

    Xi
    i 1 n

    i X i

    i 1 n i 1

    1 X ei e X T i i i 1 T
    T T

    1 plim n

    eit eis x it x is
    t 1 s 1



    Greene 50240

    book

    June 18 2002

    15 28

    316

    CHAPTER 13 Models for Panel Data

    The result is a combination of the White and Newey West estimators But the weights in the latter are 1 rather than 1 l L 1 because there is no correlation across the groups so the sum is actually just an average of nite matrices
    13 7 2 HETEROSCEDASTICITY IN THE RANDOM EFFECTS MODEL

    Since the random effects model is a generalized regression model with a known structure OLS with a robust estimator of the asymptotic covariance matrix is not the best use of the data The GLS estimator is ef cient whereas the OLS estimator is not If a perfectly general covariance structure is assumed then one might simply use Arellano s estimator described in the preceding section with a single overall constant term rather than a set of xed effects But within the setting of the random effects model it it ui allowing the disturbance variance to vary across groups would seem to be a useful extension A series of papers notably Mazodier and Trognon 1978 Baltagi and Grif n 1988 and the recent monograph by Baltagi 1995 pp 77 79 suggest how one might allow the group speci c component ui to be heteroscedastic But empirically there is an insurmountable problem with this approach In the nal analysis all estimators of the variance components must be based on sums of squared residuals and in particular an 2 estimator of ui would be estimated using a set of residuals from the distribution of ui However the data contain only a single observation on ui repeated in each observation in group i So the estimators presented for example in Baltagi 1995 use in effect 2 one residual in each case to estimate ui What appears to be a mean squared residual is T 2 2 only 1 T t 1 ui ui The properties of this estimator are ambiguous but ef ciency seems unlikely The estimators do not converge to any population gure as the sample size even T increases Heteroscedasticity in the unique component it represents a more tractable modeling possibility In Section 13 4 1 we introduced heteroscedasticity into estimation of the random effects model by allowing the group sizes to vary But the estimator there and its feasible counterpart in the next section would be the same if instead of i 2 1 Ti u 2 1 2 we were faced with i 1 i
    2 2i Ti u



    Therefore for computing the appropriate feasible generalized least squares estimator once again we need only devise consistent estimators for the variance components and then apply the GLS transformation shown above One possible way to proceed is as follows Since pooled OLS is still consistent OLS provides a usable set of residuals Using the OLS residuals for the speci c groups we would have for each group 2i ui2 ei ei T

    The residuals from the dummy variable model are purged of the individual speci c effect ui so 2i may be consistently in T estimated with 2i eilsdv eilsdv T

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data
    lsdv where eit yit xi t blsdv ai Combining terms then

    317

    u 2

    1 n

    n i 1

    eiols eiols T



    eilsdv eilsdv T



    1 n

    n

    ui2
    i 1

    We can now compute the FGLS estimator as before
    Example 13 8 Heteroscedasticity Consistent Estimation

    The xed effects estimates for the cost equation are shown in Table 13 2 on page 302 The row of standard errors labeled White 1 are the estimates based on the usual calculation For two of the three coef cients these are actually substantially smaller than the least squares results The estimates labeled White 2 are based on the groupwise heteroscedasticity model suggested earlier These estimates are essentially the same as White 1 As noted it is unclear whether this computation is preferable Of course if it were known that the groupwise model were correct then the least squares computation itself would be inef cient and in any event a two step FGLS estimator would be better The estimators of 2i ui2 based on the least squares residuals are 0 16188 0 44740 0 26639 0 90698 0 23199 and 0 39764 The six individual estimates of 2i based on the LSDV residuals are 0 0015352 0 52883 0 20233 0 62511 0 25054 and 0 32482 respectively Two of the six implied estimates the second and fth of ui2 are negative based on these results which suggests that a groupwise heteroscedastic random effects model is not an appropriate speci cation for these data
    13 7 3 AUTOCORRELATION IN PANEL DATA MODELS

    Autocorrelation in the xed effects model is a minor extension of the model of the preceding chapter With the LSDV estimator in hand estimates of the parameters of a disturbance process and transformations of the data to allow FGLS estimation proceed exactly as before The extension one might consider is to allow the autocorrelation coef cient s to vary across groups But even if so treating each group of observations as a sample in itself provides the appropriate framework for estimation In the random effects model as before there are additional complications The regression model is yit xi t it ui If it is produced by an AR 1 process it i t 1 vit then the familiar partial differencing procedure we used before would produce33 yit yi t 1 1 xit xi t 1 it i t 1 ui 1 1 xit xi t 1 vit ui 1 1 xit xi t 1 vit wi Therefore if an estimator of were in hand then one could at least treat partially differenced observations two through T in each group as the same random effects model that we just examined Variance estimators would have to be adjusted by a factor of 1 2 Two issues remain 1 how is the estimate of obtained and 2 how does one treat the rst observation For the rst of these the rst autocorrelation coef cient of
    33 See

    13 42

    Lillard and Willis 1978

    Greene 50240

    book

    June 18 2002

    15 28

    318

    CHAPTER 13 Models for Panel Data

    the LSDV residuals so as to purge the residuals of the individual speci c effects ui is a simple expedient This estimator will be consistent in nT It is in T alone but of course T is likely to be small The second question is more dif cult Estimation is simple if the rst observation is simply dropped If the panel contains many groups large n then omitting the rst observation is not likely to cause the inef ciency that it would in a single time series One can apply the Prais Winsten transformation to the rst observation in each group instead multiply by 1 2 1 2 but then an additional complication arises at the second FGLS step when the observations are transformed a second time On balance the Cochrane Orcutt estimator is probably a reasonable middle ground Baltagi 1995 p 83 discusses the procedure He also discusses estimation in higher order AR and MA processes In the same manner as in the previous section we could allow the autocorrelation to differ across groups An estimate of each i is computable using the group mean deviation data This estimator is consistent in T which is problematic in this setting In the earlier case we overcame this dif culty by averaging over n such weak estimates and achieving consistency in the dimension of n instead We lose that advantage when we allow to vary over the groups This result is the same that arose in our treatment of heteroscedasticity For the airlines data in our examples the estimated autocorrelation is 0 5086 which is fairly large Estimates of the xed and random effects models using the Cochrane Orcutt procedure for correcting the autocorrelation are given in Table 13 2 Despite the large value of r the resulting changes in the parameter estimates and standard errors are quite modest

    13 8

    RANDOM COEFFICIENTS MODELS

    Thus far the model yi Xi i has been analyzed within the familiar frameworks of heteroscedasticity and autocorrelation Although the models in Sections 13 3 and 13 4 allow considerable exibility they do entail the not entirely plausible assumption that there is no parameter variation across rms i e across the cross sectional units A fully general approach would combine all the machinery of the previous sections with a model that allows to vary across rms Parameter heterogeneity across individuals or groups can be modeled as stochastic variation 34 Suppose that we write yi Xi i i where i ui
    34 The

    13 43

    13 44

    most widely cited studies are Hildreth and Houck 1968 Swamy 1970 1971 1974 Hsiao 1975 and Chow 1984 See also Breusch and Pagan 1979 Some recent discussions are Swamy and Tavlas 1995 2001 and Hsiao 1986 The model bears some resemblance to the Bayesian approach of Section 16 2 2 but the similarity is only super cial We maintain our classical approach to estimation

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    319

    and E ui Xi 0 E ui ui Xi 13 45

    Note that if only the constant term in is random in this fashion and the other parameters are xed as before then this reproduces the random effects model we studied in Section 13 4 Assume for now that there is no autocorrelation or cross sectional correlation Thus the i that applies to a particular cross sectional unit is the outcome of a random process with mean vector and covariance matrix 35 By inserting 13 44 in 13 43 and expanding the result we nd that is a block diagonal matrix with
    ii

    E yi Xi yi Xi Xi 2 IT Xi Xi
    n

    We can write the GLS estimator as X where
    n 1

    X 1 X

    1

    y
    i 1

    Wi bi

    13 46

    Wi
    i 1



    1 i2 Xi Xi 1

    1

    i2 Xi Xi 1

    1



    Empirical implementation of this model requires an estimator of One approach see e g Swamy 1971 is to use the empirical variance of the set of n least squares estimates bi minus the average value of si2 Xi Xi 1 This matrix may not be positive de nite however in which case as Baltagi 1995 suggests one might drop the second term The more dif cult obstacle is that panels are often short and there may be too few observations to compute bi More recent applications of random parameter variation have taken a completely different approach based on simulation estimation See Section 17 8 McFadden and Train 2000 and Greene 2001 Recent research in a number of elds have extended the random parameters model to a multilevel model or hierarchical regression model by allowing the means of the coef cients to vary with measured covariates In this formulation 13 44 becomes i zi ui

    This model retains the earlier stochastic speci cation but adds the measurement equation to the generation of the random parameters In principle this is actually only a minor extension of the model used thus far as the regression equation would now become yi Xi Xi zi i Xi ui which can still be t by least squares However as noted current applications have found this formulation to be useful in many settings that go beyond the linear model We will examine an application of this approach in a nonlinear model in Section 17 8
    35 Swamy and Tavlas 2001 label this the rst generation RCM We ll examine the second generation extension at the end of this section

    Greene 50240

    book

    June 18 2002

    15 28

    320

    CHAPTER 13 Models for Panel Data

    13 9

    COVARIANCE STRUCTURES FOR POOLED TIME SERIES CROSS SECTIONAL DATA

    Many studies have analyzed data observed across countries or rms in which the number of cross sectional units is relatively small and the number of time periods is potentially relatively large The current literature in political science contains many applications of this sort For example in a cross country comparison of economic performance over time Alvarez Garrett and Lange 1991 estimated a model of the form performanceit f labor organizationit political organizationit it 13 47

    The data set analyzed in Examples 13 1 13 5 is an example in which the costs of six large rms are observed for the same 15 years The modeling context considered here differs somewhat from the longitudinal data sets considered in the preceding sections In the typical application to be considered here it is reasonable to specify a common conditional mean function across the groups with heterogeneity taking the form of different variances rather than shifts in the means Another substantive difference from the longitudinal data sets is that the observational units are often large enough e g countries that correlation across units becomes a natural part of the speci cation whereas in a panel it is always assumed away In the models we shall examine in this section the data set consists of n crosssectional units denoted i 1 n observed at each of T time periods t 1 T We have a total of nT observations In contrast to the preceding sections most of the asymptotic results we obtain here are with respect to T We will assume that n is xed The framework for this analysis is the generalized regression model yit xi t it 13 48

    An essential feature of 13 48 is that we have assumed that 1 2 n It is useful to stack the n time series yi Xi i i 1 n so that y1 X1 1 y2 X 2 2 yn Xn n

    13 49

    Each submatrix or subvector has T observations We also specify E i X 0 and E i j X i j
    ij

    so that a generalized regression model applies to each block of T observations One new element introduced here is the cross sectional covariance across the groups Collecting

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    321

    the terms above we have the full speci cation E X 0 and 11 21 n1 12 22 n2 1n 2n nn
    1n 2n

    11 21

    12

    E X



    22



    n1

    n2

    nn

    A variety of models are obtained by varying the structure of
    13 9 1

    GENERALIZED LEAST SQUARES ESTIMATION

    As we observed in our rst encounter with the generalized regression model the fully general covariance matrix in 13 49 which as stated contains nT nT 1 2 parameters is certainly inestimable But several restricted forms provide suf cient generality for empirical use To begin we assume that there is no correlation across periods which implies that i j I 11 I 12 I 1n I I I I 22 2n 21 13 50 n1 I n2 I nn I The generalized least squares estimator of is based on a known X The matrix can be written as where I 13 51 In Then
    1

    would be

    X 1 X

    1

    y

    is the n n matrix i j note the contrast to 13 21 where 11 I 12 I 1n I 21 I 22 I 2n I 1 1 I n1 n2 nn I I I

    13 52

    where i j denotes the i j th element of 1 This provides a speci c form for the estimator 1
    n n

    i j Xi X j



    n

    n

    i j Xi y j

    13 53

    i 1 j 1

    i 1 j 1

    The asymptotic covariance matrix of the GLS estimator is the inverse matrix in brackets

    Greene 50240

    book

    June 18 2002

    15 28

    322

    CHAPTER 13 Models for Panel Data 13 9 2 FEASIBLE GLS ESTIMATION

    As always in the generalized linear regression model the slope coef cients can be consistently if not ef ciently estimated by ordinary least squares A consistent estimator of i j can be based on the sample analog to the result E it jt E Using the least squares residuals we have i j ei e j T 13 54 i j T i j

    Some treatments use T K instead of T in the denominator of i j 36 There is no problem created by doing so but the resulting estimator is not unbiased regardless Note that this estimator is consistent in T Increasing T increases the information in the sample while increasing n increases the number of variance and covariance parameters to be estimated To compute the FGLS estimators for this model we require the full set of sample moments yi y j Xi X j and Xi y j for all pairs of cross sectional units With i j in hand FGLS may be computed using X 1 X 1 X 1 y 13 55

    where X and y are the stacked data matrices in 13 49 this is done in practice using 13 53 and 13 54 which involve only K K and K 1 matrices The estimated asymptotic covariance matrix for the FGLS estimator is the inverse matrix in brackets in 13 55 There is an important consideration to note in feasible GLS estimation of this model The computation requires inversion of the matrix where the i j th element is given by 13 54 This matrix is n n It is computed from the least squares residuals using 1 T
    T

    et e t
    t 1

    1 EE T

    where et is a 1 n vector containing all n residuals for the n groups at time t placed as the t th row of the T n matrix of residuals E The rank of this matrix cannot be larger than T Note what happens if n T In this case the n n matrix has rank T which is less than n so it must be singular and the FGLS estimator cannot be computed For example a study of 20 countries each observed for 10 years would be such a case This result is a de ciency of the data set not the model The population matrix is positive de nite But if there are not enough observations then the data set is too short to obtain a positive de nite estimate of the matrix The heteroscedasticity model described in the next section can always be computed however
    36 See for example Kmenta 1986 p 620 Elsewhere for example in Fomby Hill and Johnson 1984 p 327

    T is used instead

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data 13 9 3 HETEROSCEDASTICITY AND THE CLASSICAL MODEL

    323

    Two special cases of this model are of interest The groupwise heteroscedastic model of Section 11 7 2 results if the off diagonal terms in all equal zero Then the GLS estimator as we saw earlier is
    n

    X

    1

    X X

    1

    1

    y
    i 1

    1 X Xi i2 i

    1

    n i 1

    1 X yi i2 i

    Of course the disturbance variances are unknown so the two step FGLS method noted earlier now based only on the diagonal elements of would be used The second 2 special case is the classical regression model which adds the further restriction 1 2 2 2 n We would now stack the data in the pooled regression model in y X For this simple model the GLS estimator reduces to pooled ordinary least squares Beck and Katz 1995 suggested that the standard errors for the OLS estimates in this model should be corrected for the possible misspeci cation that would arise if i j i j were correctly speci ed by 13 49 instead of 2 I as now assumed The appropriate asymptotic covariance matrix for OLS in the general case is as always Asy Var b X X 1 X For the special case of Asy Var b
    i 1 ij n

    i2

    X X X 1

    i j I
    1



    n

    n

    Xi Xi

    i j Xi X j

    n

    1

    Xi Xi
    i 1



    13 56

    i 1 j 1

    This estimator is straightforward to compute with estimates of i j in hand Since the OLS estimator is consistent 13 54 may be used to estimate i j
    13 9 4 SPECIFICATION TESTS

    We are interested in testing down from the general model to the simpler forms if possible Since the model speci ed thus far is distribution free the standard approaches such as likelihood ratio tests are not available We propose the following procedure Under the null hypothesis of a common variance 2 i e the classical model the Wald statistic for testing the null hypothesis against the alternative of the groupwise heteroscedasticity model would be
    n

    W
    i 1

    i2 2 Var i2

    2



    If the null hypothesis is correct W 2 n By hypothesis plim 2 2
    d

    Greene 50240

    book

    June 18 2002

    15 28

    324

    CHAPTER 13 Models for Panel Data

    where 2 is the disturbance variance estimator from the pooled OLS regression We must now consider Var i2 Since i2 1 T
    T 2 eit t 1

    is a mean of T observations we may estimate Var i2 with fii 11 T T 1
    n T 2 eit i2 37 t 1 2

    13 57

    The modi ed Wald statistic is then W
    i 1

    i2 2 fii

    2



    A Lagrange multiplier statistic is also simple to compute and asymptotically equivalent to a likelihood ratio test we consider these below But these assume normality which we have not yet invoked To this point our speci cation is distribution free White s general test38 is an alternative To use White s test we would regress the squared OLS residuals on the P unique variables in x and the squares and cross products including a constant The chi squared statistic which has P 1 degrees of freedom is nT R2 For the full model with nonzero off diagonal elements in the preceding approach must be modi ed One might consider simply adding the corresponding terms for the off diagonal elements with a common i j 0 but this neglects the fact that under this broader alternative hypothesis the original n variance estimators are no longer uncorrelated even asymptotically so the limiting distribution of the Wald statistic is no longer chi squared Alternative approaches that have been suggested see e g Johnson and Wichern 1999 p 424 are based on the following general strategy Under the alternative hypothesis of an unrestricted the sample estimate of will be i j as de ned in 13 54 Under any restrictive null hypothesis the estimator of will be 0 a matrix that by construction will be larger than in the matrix sense de ned in Appendix A Statistics based on the excess variation such as T 0 are suggested for the testing procedure One of these is the likelihood ratio test that we will consider in Section 13 9 6
    13 9 5 AUTOCORRELATION

    The preceding discussion dealt with heteroscedasticity and cross sectional correlation Through a simple modi cation of the procedures it is possible to relax the assumption of nonautocorrelation as well It is simplest to begin with the assumption that Corr it js 0
    37 Note

    if i j

    that would apply strictly if we had observed the true disturbances it We are using the residuals as estimates of their population counterparts Since the coef cient vector is consistent this procedure will obtain the desired results Section 11 4 1

    38 See

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    325

    That is the disturbances between cross sectional units are uncorrelated Now we can take the approach of Chapter 12 to allow for autocorrelation within the cross sectional units That is it i i t 1 uit Var it i2
    2 ui 1 i2

    13 58

    For FGLS estimation of the model suppose that ri is a consistent estimator of i Then if we take each time series yi Xi separately we can transform the data using the Prais Winsten transformation 1 ri2 yi 1 1 ri2 xi 1 yi 2 ri yi 1 xi 2 ri xi 1 X i xi 3 ri xi 2 13 59 y i yi 3 ri yi 2 yiT ri yi T 1 xiT ri xi T 1 In terms of the transformed data y i and X i the model is now only heteroscedastic the transformation has removed the autocorrelation As such the groupwise heteroscedastic model applies to the transformed data We may now use weighted least squares as described earlier This requires a second least squares estimate The rst OLS regression produces initial estimates of i The transformed data are then used in a second least squares regression to obtain consistent estimators ui 2 e i e i y i X i y i X i T T 13 60

    Note that both the initial OLS and the second round FGLS estimators of are consis tent so either could be used in 13 60 We have used to denote the coef cient vector used whichever one is chosen With these results in hand we may proceed to the calculation of the groupwise heteroscedastic regression in Section 13 9 3 At the end of the calculation the moment matrix used in the last regression gives the correct asymptotic covariance matrix for the estimator now If desired then a consistent estimator of 2 i is 2i ui 2 1 ri2 13 61

    The remaining question is how to obtain the initial estimates ri There are two possible structures to consider If each group is assumed to have its own autocorrelation coef cient then the choices are the same ones examined in Chapter 12 the natural choice would be ri
    T t 2 eit ei t 1 T 2 t 1 eit

    If the disturbances have a common stochastic process with the same i then several estimators of the common are available One which is analogous to that used in the

    Greene 50240

    book

    June 18 2002

    15 28

    326

    CHAPTER 13 Models for Panel Data

    single equation case is r
    n T i 1 t 2 eit ei t 1 n T 2 i 1 t 1 eit

    13 62

    Another consistent estimator would be sample average of the group speci c estimated autocorrelation coef cients Finally one may wish to allow for cross sectional correlation across units The preceding has a natural generalization If we assume that Cov uit ujt uij then we obtain the original model in 13 49 in which the off diagonal blocks of are 2 T 1 1 j j j 1 j T 2 i j 2 T 3 i i 1 j uij i j i j 13 63 1 i j iT 1 iT 2 iT 3 1 Initial estimates of i are required as before The Prais Winsten transformation renders all the blocks in diagonal Therefore the model of cross sectional correlation in Section 13 9 2 applies to the transformed data Once again the GLS moment matrix obtained at the last step provides the asymptotic covariance matrix for Estimates of i j can be obtained from the least squares residual covariances obtained from the transformed data uij i j 13 64 1 ri r j where uij e i e j T
    13 9 6 MAXIMUM LIKELIHOOD ESTIMATION

    Consider the general model with groupwise heteroscedasticity and cross group correlation The covariance matrix is the in 13 49 We now assume that the n disturbances at time t t have a multivariate normal distribution with zero mean and this n n covariance matrix Taking logs and summing over the T periods gives the log likelihood for the sample ln L nT T 1 data ln 2 ln 2 2 2 it yit xi t i 1 n This log likelihood is analyzed at length in Section 14 2 4 so we defer the more detailed analysis until then The result is that the maximum likelihood estimator of is the generalized least squares estimator in 13 53 Since the elements of must be estimated the FGLS estimator in 13 54 is used based on the MLE of As shown in
    T

    t
    t 1

    1

    t

    13 65

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    327

    Section 14 2 4 the maximum likelihood estimator of i j yi Xi ML T y j X j ML

    is i j T 13 66

    based on the MLE of Since each MLE requires the other how can we proceed to obtain both The answer is provided by Oberhofer and Kmenta 1974 who show that for certain models including this one one can iterate back and forth between the two estimators This is the same estimator we used in Section 11 7 2 Thus the MLEs are obtained by iterating to convergence between 13 66 and X 1 X 1 X 1 y The process may begin with the consistent ordinary least squares estimator then 13 66 and so on The computations are simple using basic matrix algebra Hypothesis tests about may be done using the familiar Wald statistic The appropriate estimator of the asymptotic covariance matrix is the inverse matrix in brackets in 13 55 For testing the hypothesis that the off diagonal elements of are zero that is that there is no correlation across rms there are three approaches The likelihood ratio test is based on the statistic
    n

    LR

    T ln heteroscedastic ln general T
    i 1

    ln i2 ln

    13 67

    where i2 are the estimates of i2 obtained from the maximum likelihood estimates of the groupwise heteroscedastic model and is the maximum likelihood estimator in the unrestricted model Note how the excess variation produced by the restrictive model is used to construct the test The large sample distribution of the statistic is chi squared with n n 1 2 degrees of freedom The Lagrange multiplier test developed by Breusch and Pagan 1980 provides an alternative The general form of the statistic is
    n i 1

    LM T
    i 2 j 1

    ri2j

    13 68

    where ri2j is the i j th residual correlation coef cient If every individual had a different parameter vector then individual speci c ordinary least squares would be ef cient and ML and we would compute ri j from the OLS residuals assuming that there are suf cient observations for the computation Here however we are assuming only a single parameter vector Therefore the appropriate basis for computing the correlations is the residuals from the iterated estimator in the groupwise heteroscedastic model that is the same residuals used to compute i2 An asymptotically valid approximation to the test can be based on the FGLS residuals instead Note that this is not a procedure for testing all the way down to the classical homoscedastic regression model That case which involves different LM and LR statistics is discussed next If either the LR statistic in 13 67 or the LM statistic in 13 68 are smaller than the critical value from the table the conclusion based on this test is that the appropriate model is the groupwise heteroscedastic model For the groupwise heteroscedasticity model ML estimation reduces to groupwise weighted least squares The maximum likelihood estimator of is feasible GLS The maximum likelihood estimator of the group speci c variances is given by the diagonal

    Greene 50240

    book

    June 18 2002

    15 28

    328

    CHAPTER 13 Models for Panel Data

    element in 13 66 while the cross group covariances are now zero An additional useful result is provided by the negative of the expected second derivatives matrix of the log likelihood in 13 65 with diagonal n 1 Xi Xi 0 2 i 1 i 2 E H i i 1 n T 0 diag i 1 n 4 2 i Since the expected Hessian is block diagonal the complete set of maximum likelihood estimates can be computed by iterating back and forth between these estimators for i2 and the feasible GLS estimator of This process is also equivalent to using a set of n group dummy variables in Harvey s model of heteroscedasticity in Section 11 7 1 For testing the heteroscedasticity assumption of the model the full set of test strategies that we have used before is available The Lagrange multiplier test is probably the most convenient test since it does not require another regression after the pooled least squares regression It is convenient to rewrite log L T i2 1 2 2 i 2 i i2 where i2 is the ith unit speci c estimate of i2 based on the true but unobserved dis turbances Under the null hypothesis of equal variances regardless of what the common restricted estimator of i2 is the rst order condition for equating ln L to zero will be the OLS normal equations so the restricted estimator of is b using the pooled data To obtain the restricted estimator of i2 return to the log likelihood function Under the null hypothesis i2 2 i 1 n the rst derivative of the log likelihood function with respect to this common 2 is log LR nT 1 2 2 2 2 4
    n

    i i
    i 1

    Equating this derivative to zero produces the restricted maximum likelihood estimator 2 1 nT
    n

    i i
    i 1

    1 n

    n

    i2
    i 1

    which is the simple average of the n individual consistent estimators Using the least squares residuals at the restricted solution we obtain 2 1 nT e e and i2 1 T ei ei With these results in hand and using the estimate of the expected Hessian for the covariance matrix the Lagrange multiplier statistic reduces to
    n

    LM
    i 1

    T 2 2

    i2 1 2

    2

    2 4 T



    T 2

    n i 1

    i2 1 2

    2



    The statistic has n 1 degrees of freedom It has only n 1 since the restriction is that the variances are all equal to each other not a speci c value which is n 1 restrictions With the unrestricted estimates as an alternative test procedure we may use the Wald statistic If we assume normality then the asymptotic variance of each variance

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    329

    estimator is 2 i4 T and the variances are asymptotically uncorrelated Therefore the Wald statistic to test the hypothesis of a common variance 2 using i2 to estimate i2 is
    n

    W
    i 1

    i2 2

    2

    2 i4 T

    1



    T 2

    n i 1

    2 1 i2

    2



    Note the similarity to the Lagrange multiplier statistic The estimator of the common variance would be the pooled estimator from the rst least squares regression Recall we produced a general counterpart for this statistic for the case in which disturbances are not normally distributed We can also carry out a likelihood ratio test using the test statistic in Section 12 3 4 The appropriate likelihood ratio statistic is
    n

    LR T ln homoscedastic ln heteroscedastic nT ln 2
    i 1

    T ln i2

    where 2 ee nT and i2 i i T

    with all residuals computed using the maximum likelihood estimators This chi squared statistic has n 1 degrees of freedom
    13 9 7 APPLICATION TO GRUNFELD S INVESTMENT DATA

    To illustrate the techniques developed in this section we will use a panel of data that has for several decades provided a useful tool for examining multiple equation estimators Appendix Table F13 1 lists part of the data used in a classic study of investment demand 39 The data consist of time series of 20 yearly observations for ve rms of 10 in the original study and three variables Iit gross investment Fit market value of the rm at the end of the previous year Cit value of the stock of plant and equipment at the end of the previous year All gures are in millions of dollars The variables Fit and Iit re ect anticipated pro t and the expected amount of replacement investment required 40 The model to be estimated with these data is Iit 1 2 Fit 3 Cit it 41
    39 See Grunfeld 1958 and Grunfeld and Griliches 1960 The data were also used in Boot and deWitt 1960

    Although admittedly not current these data are unusually cooperative for illustrating the different aspects of estimating systems of regression equations
    40 In the original study the authors used the notation 41 Note

    Ft 1 and Ct 1 To avoid possible con icts with the usual subscripting conventions used here we have used the preceding notation instead that we are modeling investment a ow as a function of two stocks This could be a theoretical misspeci cation it might be preferable to specify the model in terms of planned investment But 40 years after the fact we ll take the speci ed model as it is

    Greene 50240

    book

    June 18 2002

    15 28

    330

    CHAPTER 13 Models for Panel Data

    TABLE 13 4

    Estimated Parameters and Estimated Standard Errors
    1 2 3

    Homoscedasticity Least squares OLS standard errors White correction Beck and Katz Heteroscedastic Feasible GLS Maximum likelihood

    48 0297 0 10509 0 30537 R2 0 77886 2 15708 84 log likelihood 624 9928 21 16 0 01121 0 04285 15 017 0 00915 0 05911 10 814 0 00832 0 033043 36 2537 0 09499 0 33781 6 1244 0 00741 0 03023 23 2582 0 09435 0 33371 4 815 0 00628 0 2204 Pooled 2 15 853 08 log likelihood 564 535 28 247 4 888 2 217 1 96 0 089101 0 005072 0 02361 0 004291 log likelihood 515 422 0 086051 0 009599 0 07522 0 005710 0 33401 0 01671 0 17095 0 01525

    Cross section correlation Feasible GLS Maximum likelihood

    Autocorrelation model Heteroscedastic Cross section correlation

    23 811 7 694 15 424 4 595

    0 33215 0 03549 0 33807 0 01421

    where i indexes rms and t indexes years Different restrictions on the parameters and the variances and covariances of the disturbances will imply different forms of the model By pooling all 100 observations and estimating the coef cients by ordinary least squares we obtain the rst set of results in Table 13 4 To make the results comparable all variance estimates and estimated standard errors are based on e e nT There is no degrees of freedom correction The second set of standard errors given are White s robust estimator see 10 14 and 10 23 The third set of standard errors given above are the robust standard errors based on Beck and Katz 1995 using 13 56 and 13 54 The estimates of i2 for the model of groupwise heteroscedasticity are shown in Table 13 5 The estimates suggest that the disturbance variance differs widely across rms To investigate this proposition before tting an extended model we can use the tests for homoscedasticity suggested earlier Based on the OLS results the LM statistic equals 46 63 The critical value from the chi squared distribution with four degrees of freedom is 9 49 so on the basis of the LM test we reject the null hypothesis of homoscedasticity To compute White s test statistic we regress the squared least squares residuals on a constant F C F 2 C 2 and FC The R2 in this regression is 0 36854 so the chi squared statistic is nT R2 36 854 with ve degrees of freedom The ve percent critical value from the table for the chi squared statistic with ve degrees of freedom is 11 07 so the null hypothesis is rejected again The likelihood ratio statistic based on

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    331

    TABLE 13 5

    Estimated Group Speci c Variances
    2 GM 2 C H 2 GE 2 W E 2 U S

    Based on OLS Heteroscedastic FGLS Heteroscedastic ML Cross Correlation FGLS Autocorrelation s2 i ui u Autocorrelation s2i ei e

    9 410 91 8 612 14 2897 08 8 657 72 10050 52 6525 7 8453 6

    755 85 409 19 136 704 175 80 305 61 253 104 270 150

    34 288 49 36 563 24 5801 17 40 210 96 34556 6 14 620 8 16 073 2

    633 42 777 97 323 357 1 240 03 833 36 232 76 349 68

    33 455 51 32 902 83 7000 857 29 825 21 34468 98 8 683 9 12 994 2

    the ML results in Table 13 4 is
    n

    100 ln s
    2 2 i 1

    20 ln i2 120 915

    This result far exceeds the tabled critical value The Lagrange multiplier statistic based on all variances computed using the OLS residuals is 46 629 The Wald statistic based on the FGLS estimated variances and the pooled OLS estimate 15 708 84 is 17 676 25 We observe the common occurrence of an extremely large Wald test statistic If the test is based on the sum of squared FGLS residuals 2 15 853 08 then W 18 012 86 which leads to the same conclusion To compute the modi ed Wald statistic absent the assumption of normality we require the estimates of the variances of the FGLS residual variances The square roots of fii are shown in Table 13 5 in parentheses after the FGLS residual variances The modi ed Wald statistic is W 14 681 3 which is consistent with the other results We proceed to reestimate the regression allowing for heteroscedasticity The FGLS and maximum likelihood estimates are shown in Table 13 4 The latter are obtained by iterated FGLS Returning to the least squares estimator we should expect the OLS standard errors to be incorrect given our ndings There are two possible corrections we can use the White estimator and direct computation of the appropriate asymptotic covariance matrix The Beck et al estimator is a third candidate but it neglects to use the known restriction that the off diagonal elements in are zero The various estimates shown at the top of Table 13 5 do suggest that the OLS estimated standard errors have been distorted The correlation matrix for the various sets of residuals using the estimates in Table 13 4 is given in Table 13 6 42 The several quite large values suggests that the more general model will be appropriate The two test statistics for testing the null hypothesis of a diagonal based on the log likelihood values in Table 13 4 are LR 2 565 535 515 422 100 226 and based on the MLE s for the groupwise heteroscedasticity model LM 66 067 the MLE of based on the coef cients from the heteroscedastic model is not shown For 10 degrees of freedom the critical value from the chi squared table is 23 21 so both results lead to rejection of the null hypothesis of a diagonal We conclude that
    42 The

    estimates based on the MLEs are somewhat different but the results of all the hypothesis tests are the

    same

    Greene 50240

    book

    June 18 2002

    15 28

    332

    CHAPTER 13 Models for Panel Data

    TABLE 13 6

    Estimated Cross Group Correlations Based on FGLS Estimates Order is OLS FGLS heteroscedastic FGLS correlation Autocorrelation
    Estimated and Correlations GM CH GE WE US

    GM CH

    GE

    WE

    US

    1 0 344 0 185 0 349 0 225 0 182 0 185 0 248 0 287 0 352 0 469 0 356 0 467 0 121 0 016 0 716 0 015

    1

    0 283 0 144 0 158 0 105 0 343 0 186 0 246 0 166 0 167 0 222 0 244 0 245

    1

    0 890 0 881 0 895 0 885 0 151 0 122 0 176 0 139

    1 0 085 0 119 0 040 0 101

    1

    the simple heteroscedastic model is not general enough for these data If the null hypothesis is that the disturbances are both homoscedastic and uncorrelated across groups then these two tests are inappropriate A likelihood ratio test can be constructed using the OLS results and the MLEs from the full model the test statistic would be LR nT ln e e nT T ln This statistic is just the sum of the LR statistics for the test of homoscedasticity and the statistic given above For these data this sum would be 120 915 100 226 221 141 which is far larger than the critical value as might be expected FGLS and maximum likelihood estimates for the model with cross sectional correlation are given in Table 13 4 The estimated disturbance variances have changed dramatically due in part to the quite large off diagonal elements It is noteworthy however that despite the large changes in with the exceptions of the MLE s in the cross section correlation model the parameter estimates have not changed very much This sample is moderately large and all estimators are consistent so this result is to be expected We shall examine the effect of assuming that all ve rms have the same slope parameters in Section 14 2 3 For now we note that one of the effects is to in ate the disturbance correlations When the Lagrange multiplier statistic in 13 68 is recomputed with rm by rm separate regressions the statistic falls to 29 04 which is still signi cant but far less than what we found earlier We now allow for different AR 1 disturbance processes for each rm The rm speci c autocorrelation coef cients of the ordinary least squares residuals are r 0 478 0 251 0 301 0 578 0 576

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    333

    An interesting problem arises at this point If one computes these autocorrelations using the standard formula then the results can be substantially affected because the group speci c residuals may not have mean zero Since the population mean is zero if the model is correctly speci ed then this point is only minor As we will explore later however this model is not correctly speci ed for these data As such the nonzero residual mean for the group speci c residual vectors matters greatly The vector of autocorrelations computed without using deviations from means is r0 0 478 0 793 0 905 0 602 0 868 Three of the ve are very different Which way the computations should be done now becomes a substantive question The asymptotic theory weighs in favor of 13 62 As a practical matter in small or moderately sized samples such as this one as this example demonstrates the mean deviations are preferable Table 13 4 also presents estimates for the groupwise heteroscedasticity model and for the full model with cross sectional correlation with the corrections for rst order autocorrelation The lower part of the table displays the recomputed group speci c variances and cross group correlations
    13 9 8 SUMMARY

    The preceding sections have suggested a variety of different speci cations of the generalized regression model Which ones apply in a given situation depends on the setting Homoscedasticity will depend on the nature of the data and will often be directly observable at the outset Uncorrelatedness across the cross sectional units is a strong assumption particularly because the model assigns the same parameter vector to all units Autocorrelation is a qualitatively different property Although it does appear to arise naturally in time series data one would want to look carefully at the data and the model speci cation before assuming that it is present The properties of all these estimators depend on an increase in T so they are generally not well suited to the types of data sets described in Sections 13 2 13 8 Beck et al 1993 suggest several problems that might arise when using this model in small samples If T n then with or without a correction for autocorrelation the matrix is an n n matrix of rank T or less and is thus singular which precludes FGLS estimation A preferable approach then might be to use pooled OLS and make the appropriate correction to the asymptotic covariance matrix But in this situation there remains the possibility of accommodating cross unit heteroscedasticity One could use the groupwise heteroscedasticity model The estimators will be consistent and more ef cient than OLS although the standard errors will be inappropriate if there is crosssectional correlation An appropriate estimator that extends 11 17 would be Est Var b X V 1 X 1 X V 1 V 1 X X V 1 X 1 1 n n n 1 i j Xi Xi Xi X j ii ii j j
    i 1 n i 1 j 1 n n

    n i 1 n

    1 Xi Xi ii
    1

    1


    i 1

    1 Xi Xi ii

    1



    ri2j i j

    Xi X j

    i 1 j 1

    i 1

    1 Xi Xi ii



    Greene 50240

    book

    June 18 2002

    15 28

    334

    CHAPTER 13 Models for Panel Data

    Note that this estimator bases all estimates on the model of groupwise heteroscedasticity but it is robust to the possibility of cross sectional correlation When n is large relative to T the number of estimated parameters in the autocorrelation model becomes very large relative to the number of observations Beck and Katz 1995 found that as a consequence the estimated asymptotic covariance matrix for the FGLS slopes tends to underestimate the true variability of the estimator They suggest two compromises First use OLS and the appropriate covariance matrix and second impose the restriction of equal autocorrelation coef cients across groups

    13 10

    SUMMARY AND CONCLUSIONS

    The preceding has shown a few of the extensions of the classical model that can be obtained when panel data are available In principle any of the models we have examined before this chapter and all those we will consider later including the multiple equation models can be extended in the same way The main advantage as we noted at the outset is that with panel data one can formally model the heterogeneity across groups that is typical in microeconomic data We will nd in Chapter 14 that to some extent this model of heterogeneity can be misleading What might have appeared at one level to be differences in the variances of the disturbances across groups may well be due to heterogeneity of a different sort associated with the coef cient vectors We will consider this possibility in the next chapter We will also examine some additional models for disturbance processes that arise naturally in a multiple equations context but are actually more general cases of some of the models we looked at above such as the model of groupwise heteroscedasticity

    Key Terms and Concepts
    Arellano Bond and Bover Hausman and Taylor

    estimator Between groups estimator Contrasts Covariance structures Dynamic panel data model Feasible GLS Fixed effects model Generalized least squares GMM estimator Group means Group means estimator Groupwise heteroscedasticity Hausman test

    estimator Heterogeneity Hierarchical regression Individual effect Instrumental variables estimator Least squares dummy variable model LM test LR test Longitudinal data sets Matrix weighted average Maximum likelihood Panel data Pooled regression

    Random coef cients Random effects model Robust covariance

    matrix
    Unbalanced panel Wald test Weighted average Within groups estimator

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    335

    Exercises 1 The following is a panel of data on investment y and pro t x for n 3 rms over T 10 periods
    i 1 t y x y i 2 x y i 3 x

    1 2 3 4 5 6 7 8 9 10

    13 32 26 30 2 62 14 94 15 80 12 20 14 93 29 82 20 32 4 77

    12 85 25 69 5 48 13 79 15 41 12 59 16 64 26 45 19 64 5 43

    20 30 17 47 9 31 18 01 7 63 19 84 13 76 10 00 19 51 18 32

    22 93 17 96 9 16 18 73 11 31 21 15 16 13 11 61 19 55 17 06

    8 85 19 60 3 87 24 19 3 99 5 73 26 68 11 49 18 49 20 84

    8 65 16 55 1 47 24 91 5 01 8 34 22 70 8 36 15 44 17 87

    a Pool the data and compute the least squares regression coef cients of the model yit xit it b Estimate the xed effects model of 13 2 and then test the hypothesis that the constant term is the same for all three rms c Estimate the random effects model of 13 18 and then carry out the Lagrange multiplier test of the hypothesis that the classical model without the common effect applies d Carry out Hausman s speci cation test for the random versus the xed effect model 2 Suppose that the model of 13 2 is formulated with an overall constant term and n 1 dummy variables dropping say the last one Investigate the effect that this supposition has on the set of dummy variable coef cients and on the least squares estimates of the slopes 3 Use the data in Section 13 9 7 the Grunfeld data to t the random and xed effect models There are ve rms and 20 years of data for each Use the F LM and or Hausman statistics to determine which model the xed or random effects model is preferable for these data 4 Derive the log likelihood function for the model in 13 18 assuming that it and ui are normally distributed Hints Write the log likelihood function as ln L n i 1 ln Li where ln Li is the log likelihood function for the T observations in group i These T observations are joint normally distributed with covariance matrix given in 13 20 The log likelihood is the sum of the logs of the joint normal densities of the n sets of T observations it ui yit xit This step will involve the inverse and determinant of Use B 66 to prove that 2 1 1 2 I 2 u 2 iT iT T u To nd the determinant use the product of the characteristic roots Note rst that

    Greene 50240

    book

    June 18 2002

    15 28

    336

    CHAPTER 13 Models for Panel Data
    2 2 I u ii 2 T I
    2 u ii 2

    The roots are determined by or
    2 u ii c 1 c 2

    I

    2 u ii c c 2

    Any vector whose elements sum to zero is a solution There are T 1 such independent vectors so T 1 characteristic roots are 1 0 or 1 Premultiply the expression by i to obtain the remaining characteristic root Remember to add one to the result Now collect terms to obtain the log likelihood 5 Unbalanced design for random effects Suppose that the random effects model of Section 13 4 is to be estimated with a panel in which the groups have different numbers of observations Let Ti be the number of observations in group i a Show that the pooled least squares estimator in 13 11 is unbiased and consistent despite this complication b Show that the estimator in 13 29 based on the pooled least squares estimator of or for that matter any consistent estimator of is a consistent estimator of 2 6 What are the probability limits of 1 n LM where LM is de ned in 13 31 under 2 2 the null hypothesis that u 0 and under the alternative that u 0 7 A two way xed effects model Suppose that the xed effects model is modi ed to include a time speci c dummy variable as well as an individual speci c variable Then yit i t xit it At every observation the individual and timespeci c dummy variables sum to 1 so there are some redundant coef cients The discussion in Section 13 3 3 shows that one way to remove the redundancy is to include an overall constant and drop one of the time speci c and one of the timedummy variables The model is thus yit i 1 t 1 xit it Note that the respective time or individual speci c variable is zero when t or i equals one Ordinary least squares estimates of are then obtained by regression of yit yi y t y on xit xi x t x Then i 1 and t 1 are estimated using the expressions in 13 17 while m y b x Using the following data estimate the full set of coef cients for the least squares dummy variable model
    t 1
    y x1 x2 y x1 x2 y x1 x2 y x1 x2 21 7 26 4 5 79 21 8 19 6 3 36 25 2 13 4 9 57 15 3 14 2 4 09

    t 2
    10 9 17 3 2 60 21 0 22 8 1 59 41 9 29 7 9 62 25 9 18 0 9 56

    t 3
    33 5 23 8 8 36 33 8 27 8 6 19 31 3 21 6 6 61 21 9 29 9 2 18

    t 4

    t 5

    t 6
    16 1 21 1 1 03 30 0 16 0 9 87 27 9 24 1 5 99 26 1 20 1 8 27

    t 7
    19 0 17 5 3 11 21 7 28 8 1 31 33 3 10 5 9 00 34 8 27 6 9 16

    t 8
    18 1 22 9 4 87 24 9 16 8 5 42 20 5 22 1 1 75 22 6 27 4 5 24

    t 9
    14 9 22 9 3 79 21 9 11 8 6 32 16 7 17 0 1 74 29 0 28 5 7 92

    t 10
    23 2 14 9 7 24 23 6 18 6 5 35 20 7 20 5 1 82 37 1 28 6 9 63

    i 1 22 0 17 6 17 6 26 2 5 50 5 26 18 0 14 0 3 75 27 8 25 1 7 24 15 5 14 1 5 43 i 2 12 2 11 4 1 59 i 3 13 2 14 1 1 64 i 4 16 7 18 4 6 33

    Greene 50240

    book

    June 18 2002

    15 28

    CHAPTER 13 Models for Panel Data

    337

    Test the hypotheses that 1 the period effects are all zero 2 the group effects are all zero and 3 both period and group effects are zero Use an F test in each case 8 Two way random effects model We modify the random effects model by the addition of a time speci c disturbance Thus yit xit it ui vt where E it E ui E vt 0 E it u j E it vs E ui vt 0 for all i j t s Var it 2 Cov it js 0 for all i j t s
    2 Var ui u Cov ui u j 0 for all i j 2 Var vt v Cov vt vs 0 for all t s

    Write out the full covariance matrix for a data set with n 2 and T 2 9 The model y1 x 1 1 y2 x2 2 satis es the groupwise heteroscedastic regression model of Section 11 7 2 All variables have zero means The following sample second moment matrix is obtained from a sample of 20 observations y y2 x1 x2 1 y1 20 6 4 3 y2 6 10 3 6 3 5 2 x1 4 6 2 10 x2 3 a Compute the two separate OLS estimates of their sampling variances the 2 2 estimates of 1 and 2 and the R2 s in the two regressions 2 2 b Carry out the Lagrange multiplier test of the hypothesis that 1 2 c Compute the two step FGLS estimate of and an estimate of its sampling variance Test the hypothesis that equals 1 d Carry out the Wald test of equal disturbance variances 2 2 e Compute the maximum likelihood estimates of 1 and 2 by iterating the FGLS estimates to convergence f Carry out a likelihood ratio test of equal disturbance variances g Compute the two step FGLS estimate of assuming that the model in 14 7 applies That is allow for cross sectional correlation Compare your results with those of part c 10 Suppose that in the groupwise heteroscedasticity model of Section 11 7 2 Xi is the same for all i What is the generalized least squares estimator of How would you compute the estimator if it were necessary to estimate i2 11 Repeat Exercise 10 for the cross sectionally correlated model of Section 13 9 1

    Greene 50240

    book

    June 18 2002

    15 28

    338

    CHAPTER 13 Models for Panel Data

    12 The following table presents a hypothetical panel of data
    i 1 t y x y i 2 x y i 3 x

    1 2 3 4 5 6 7 8 9 10

    30 27 35 59 17 90 44 90 37 58 23 15 30 53 39 90 20 44 36 85

    24 31 28 47 23 74 25 44 20 80 10 55 18 40 25 40 13 57 25 60

    38 71 29 74 11 29 26 17 5 85 29 01 30 38 36 03 37 90 33 90

    28 35 27 38 12 74 21 08 14 02 20 43 28 13 21 78 25 65 11 66

    37 03 43 82 37 12 24 34 26 15 26 01 29 64 30 25 25 41 26 04

    21 16 26 76 22 21 19 02 18 64 18 97 21 35 21 34 15 86 13 28

    a Estimate the groupwise heteroscedastic model of Section 11 7 2 Include an estimate of the asymptotic variance of the slope estimator Use a two step procedure basing the FGLS estimator at the second step on residuals from the pooled least squares regression b Carry out the Wald Lagrange multiplier and likelihood ratio tests of the hypothesis that the variances are all equal For the likelihood ratio test use the FGLS estimates c Carry out a Lagrange multiplier test of the hypothesis that the disturbances are uncorrelated across individuals

    Greene 50240

    book

    June 19 2002

    10 4

    14

    SYSTEMS OF REGRESSION EQUATIONS

    Q
    14 1 INTRODUCTION There are many settings in which the models of the previous chapters apply to a group of related variables In these contexts it makes sense to consider the several models jointly Some examples follow 1 The capital asset pricing model of nance speci es that for a given security rit r f t i i rmt rf t it where rit is the return over period t on security i rf t is the return on a risk free security rmt is the market return and i is the security s beta coef cient The disturbances are obviously correlated across securities The knowledge that the return on security i exceeds the risk free rate by a given amount gives some information about the excess return of security j at least for some j s It may be useful to estimate the equations jointly rather than ignore this connection 2 In the Grunfeld Boot and de Witt investment model of Section 13 9 7 we examined a set of rms each of which makes investment decisions based on variables that re ect anticipated pro t and replacement of the capital stock We will now specify Iit 1i 2i Fit 3i Cit it Whether the parameter vector should be the same for all rms is a question that we shall study in this chapter But the disturbances in the investment equations certainly include factors that are common to all the rms such as the perceived general health of the economy as well as factors that are speci c to the particular rm or industry 3 In a model of production the optimization conditions of economic theory imply that if a rm faces a set of factor prices p then its set of cost minimizing factor demands for producing output Y will be a set of equations of the form xm fm Y p The model is x1 f1 Y p 1 x2 f2 Y p 2 xM f M Y p M Once again the disturbances should be correlated In addition the same parameters of the production technology will enter all the demand equations so the set of equations
    339

    Greene 50240

    book

    June 19 2002

    10 4

    340

    CHAPTER 14 Systems of Regression Equations

    have cross equation restrictions Estimating the equations separately will waste the information that the same set of parameters appears in all the equations All these examples have a common multiple equation structure which we may write as y 1 X1 1 1 y2 X2 2 2 y M X M M M 14 1

    There are M equations and T observations in the sample of data used to estimate them 1 The second and third examples embody different types of constraints across equations and different structures of the disturbances A basic set of principles will apply to them all however 2 Section 14 2 below examines the general model in which each equation has its own xed set of parameters and examines ef cient estimation techniques Production and consumer demand models are a special case of the general model in which the equations of the model obey an adding up constraint that has important implications for speci cation and estimation Some general results for demand systems are considered in Section 14 3 In Section 14 4 we examine a classic application of the model in Section 14 3 that illustrates a number of the interesting features of the current genre of demand studies in the applied literature Section 14 4 introduces estimation of nonlinear systems instrumental variable estimation and GMM estimation for a system of equations
    Example 14 1 Grunfeld s Investment Data

    To illustrate the techniques to be developed in this chapter we will use the Grunfeld data rst examined in Section 13 9 7 in the previous chapter Grunfeld s model is now I i t 1i 2i Fi t 3i Ci t i t where i indexes rms t indexes years and I i t gross investment Fi t market value of the rm at the end of the previous year Ci t value of the stock of plant and equipment at the end of the previous year All gures are in millions of dollars The sample consists of 20 years of observations 1935 1954 on ve rms The model extension we consider in this chapter is to allow the coef cients to vary across rms in an unstructured fashion

    14 2

    THE SEEMINGLY UNRELATED REGRESSIONS MODEL

    The seemingly unrelated regressions SUR model in 14 1 is yi Xi i i
    1 The 2 See

    i 1 M

    14 2

    use of T is not necessarily meant to imply any connection to time series For instance in the third example above the data might be cross sectional the surveys by Srivastava and Dwivedi 1979 Srivastava and Giles 1987 and Feibig 2001

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    341

    where 1 2 M and E X1 X2 X M 0 E X1 X2 X M

    We assume that a total of T observations are used in estimating the parameters of the M equations 3 Each equation involves Km regressors for a total of K in 1 Ki We will require T Ki The data are assumed to be well behaved as described in Section 5 2 1 and we shall not treat the issue separately here For the present we also assume that disturbances are uncorrelated across observations Therefore E it js X1 X2 X M i j The disturbance formulation is therefore E i j X1 X2 X M i j IT or 11 I 21 I M1 I 12 I 22 I M2 I 1 M I 2 M I MM I if t s and 0 otherwise

    E X1 X2 X M

    14 3

    Note that when the data matrices are group speci c observations on the same variables as in Example 14 1 the speci cation of this model is precisely that of the covariance structures model of Section 13 9 save for the extension here that allows the parameter vector to vary across groups The covariance structures model is therefore a testable special case 4 It will be convenient in the discussion below to have a term for the particular kind of model in which the data matrices are group speci c data sets on the same set of variables The Grunfeld model noted in Example 14 1 is such a case This special case of the seemingly unrelated regressions model is a multivariate regression model In contrast the cost function model examined in Section 14 5 is not of this type it consists of a cost function that involves output and prices and a set of cost share equations that have only a set of constant terms We emphasize this is merely a convenient term for a speci c form of the SUR model not a modi cation of the model itself
    14 2 1 GENERALIZED LEAST SQUARES

    Each equation is by itself a classical regression Therefore the parameters could be estimated consistently if not ef ciently one equation at a time by ordinary least squares
    3 There

    are a few results for unequal numbers of observations such as Schmidt 1977 Baltagi Garvin and Kerman 1989 Conniffe 1985 Hwang 1990 and Im 1994 But generally the case of xed T is the norm in practice equality is incorrectly assumed

    4 This is the test of Aggregation Bias that is the subject of Zellner 1962 1963 The bias results if parameter

    Greene 50240

    book

    June 19 2002

    10 4

    342

    CHAPTER 14 Systems of Regression Equations

    The generalized regression model applies to the stacked model y1 X1 0 1 1 0 y2 0 X2 2 0 2 X 0 0 XM yM M M

    14 4

    Therefore the ef cient estimator is generalized least squares 5 The model has a particularly convenient form For the t th observation the M M covariance matrix of the disturbances is 11 12 1 M 21 22 2 M 14 5 M1 M2 MM so in 14 3 and
    1

    I



    1

    I

    14 6

    Denoting the i j th element of X
    1

    1

    by i j we nd that the GLS estimator is
    1

    X 1 X

    y X

    1

    I X 1 X

    1

    I y

    Expanding the Kronecker products produces 21 X2 X1 M1 X M X1 11 X1 X1 12 X1 X2 22 X2 X2 M2 X M X2 2 M X2 X M MM X MX M 1 M X1 X M 1 X2 y j M Mj X M y j j 1
    M j 1 2j M j 1

    1 j X1 y j

    14 7

    The asymptotic covariance matrix for the GLS estimator is the inverse matrix in 14 7 All the results of Chapter 10 for the generalized regression model extend to this model which has both heteroscedasticity and autocorrelation This estimator is obviously different from ordinary least squares At this point however the equations are linked only by their disturbances hence the name seemingly unrelated regressions model so it is interesting to ask just how much ef ciency is gained by using generalized least squares instead of ordinary least squares Zellner 1962 and Dwivedi and Srivastava 1978 have analyzed some special cases in detail
    5 See

    Zellner 1962 and Telser 1964

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    343

    1

    2

    3

    If the equations are actually unrelated that is if i j 0 for i j then there is obviously no payoff to GLS estimation of the full set of equations Indeed full GLS is equation by equation OLS 6 If the equations have identical explanatory variables that is if Xi X j then OLS and GLS are identical We will turn to this case in Section 14 2 2 and then examine an important application in Section 14 2 5 7 If the regressors in one block of equations are a subset of those in another then GLS brings no ef ciency gain over OLS in estimation of the smaller set of equations thus GLS and OLS are once again identical We will look at an application of this result in Section 19 6 5 8

    In the more general case with unrestricted correlation of the disturbances and different regressors in the equations the results are complicated and dependent on the data Two propositions that apply generally are as follows 1 2 The greater is the correlation of the disturbances the greater is the ef ciency gain accruing to GLS The less correlation there is between the X matrices the greater is the gain in ef ciency in using GLS 9
    SEEMINGLY UNRELATED REGRESSIONS WITH IDENTICAL REGRESSORS

    14 2 2

    The case of identical regressors is quite common notably in the capital asset pricing model in empirical nance see Section 14 2 5 In this special case generalized least squares is equivalent to equation by equation ordinary least squares Impose the assumption that Xi X j X so that Xi X j X X for all i and j in 14 7 The inverse matrix on the right hand side now becomes 1 X X 1 which using A 76 equals X X 1 Also on the right hand side each term Xi y j equals X y j which in turn equals X Xb j With these results after moving the common X X out of the summations on the right hand side we obtain X X 1 21 M1 X X 1 11 X X 1 12 X X 1 22 X X 1 M2 X X 1 X X M 1l b l l 1 2 M X X 1 X X M 2l bl l 1 14 8 M MM X X 1 Ml X X b 1 M X X 1
    l 1 l

    6 See 7 An

    also Baltagi 1989 and Bartels and Feibig 1991 for other cases in which OLS GLS

    intriguing result albeit probably of negligible practical signi cance is that the result also applies if the X s are all nonsingular and not necessarily identical linear combinations of the same set of variables The formal result which is a corollary of Kruskal s Theorem see Davidson and MacKinnon 1993 p 294 is that OLS and GLS will be the same if the K columns of X are a linear combination of exactly K characteristic vectors of By showing the equality of OLS and GLS here we have veri ed the conditions of the corollary The general result is pursued in the exercises The intriguing result cited is now an obvious case result was analyzed by Goldberger 1970 and later by Revankar 1974 and Conniffe 1982a b also Binkley 1982 and Binkley and Nelson 1988

    8 The 9 See

    Greene 50240

    book

    June 19 2002

    10 4

    344

    CHAPTER 14 Systems of Regression Equations

    Now we isolate one of the subvectors say the rst from After multiplication the moment matrices cancel and we are left with
    M M M M M

    1
    j 1

    1 j
    l 1

    j 1 bl b1
    j 1

    1 j j 1 b2
    j 1

    1 j j 2 b M
    j 1

    1 j j M

    1 The terms in parentheses are the elements of the rst row of I so the end result 1 b1 For the remaining subvectors which are obtained the same way i bi which is 10 is the result we sought To reiterate the important result we have here is that in the SUR model when all equations have the same regressors the ef cient estimator is single equation ordinary least squares OLS is the same as GLS Also the asymptotic covariance matrix of for this case is given by the large inverse matrix in brackets in 14 8 which would be estimated by

    Est Asy Cov i j i j X X 1

    i j 1 M where i j i j

    1 e ej Ti

    Except in some special cases this general result is lost if there are any restrictions on either within or across equations We will examine one of those cases the block of zeros restriction in Sections 14 2 6 and 19 6 5
    14 2 3 FEASIBLE GENERALIZED LEAST SQUARES

    The preceding discussion assumes that is known which as usual is unlikely to be the case FGLS estimators have been devised however 11 The least squares residuals may be used of course to estimate consistently the elements of with i j si j ei e j T 14 9

    The consistency of si j follows from that of bi and b j A degrees of freedom correction in the divisor is occasionally suggested Two possibilities are si j ei e j T Ki T K j 1 2 and si j ei e j 12 T max Ki K j

    The second is unbiased only if i equals j or Ki equals K j whereas the rst is unbiased only if i equals j Whether unbiasedness of the estimate of used for FGLS is a virtue here is uncertain The asymptotic properties of the feasible GLS estimator do not rely on an unbiased estimator of only consistency is required All our results from Chapters 10 13 for FGLS estimators extend to this model with no modi cation We

    10 See 11 See 12 See

    Hashimoto and Ohtani 1996 for discussion of hypothesis testing in this case Zellner 1962 and Zellner and Huang 1962 as well Judge et al 1985 Theil 1971 and Srivistava and Giles 1987

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    345

    shall use 14 9 in what follows With s11 s21 S s M1

    s12 s22 s M2

    s1 M s2 M s MM

    14 10

    in hand FGLS can proceed as usual Iterated FGLS will be maximum likelihood if it is based on 14 9 Goodness of t measures for the system have been devised For instance McElroy 1977 suggested the systemwide measure
    2 R 1

    1
    M i 1 M j 1

    ij

    T t 1 yit

    yi y jt y j

    1

    M tr 1 S yy

    14 11

    where indicates the FGLS estimate The advantage of the second formulation is that it involves M M matrices which are typically quite small whereas is MT MT In our case M equals 5 but MT equals 100 The measure is bounded by 0 and 1 and is related to the F statistic used to test the hypothesis that all the slopes in the model are zero Fit measures in this generalized regression model have all the shortcomings discussed in Section 10 5 1 An additional problem for this model is that overall t measures such as that in 14 11 will obscure the variation in t across equations For the investment example using the FGLS residuals for the least restrictive model in Table 13 4 the covariance structures model with identical coef cient vectors McElroy s measure gives a value of 0 846 But as can be seen in Figure 14 1 this apparently good

    FIGURE 14 1

    FGLS Residuals with Equality Restrictions

    400

    200

    Residual

    0

    200

    400 General Motors Chrysler General Electric Westinghouse U S Steel

    Greene 50240

    book

    June 19 2002

    10 4

    346

    CHAPTER 14 Systems of Regression Equations

    400

    240

    Residual

    80 0 80

    240

    400 General Motors
    FIGURE 14 2

    Chrysler
    SUR Residuals

    General Electric

    Westinghouse

    U S Steel

    overall t is an aggregate of mediocre ts for Chrysler and Westinghouse and obviously terrible ts for GM GE and U S Steel Indeed the conventional measure for GE based on the same FGLS residuals 1 eGE eGE yGE M0 yGE is 16 7 We might use 14 11 to compare the t of the unrestricted model with separate coef cient vectors for each rm with the restricted one with a common coef cient vector The result in 14 11 with the FGLS residuals based on the seemingly unrelated regression estimates in Table 14 1 in Example 14 2 gives a value of 0 871 which compared to 0 846 appears to be an unimpressive improvement in the t of the model But a comparison of the residual plot in Figure 14 2 with that in Figure 14 1 shows that on the contrary the t of the model has improved dramatically The upshot is that although a t measure for the system might have some virtue as a descriptive measure it should be used with care For testing a hypothesis about a statistic analogous to the F ratio in multiple regression analysis is F J MT K R q R X 1 X 1 R 1 R q J 1 MT K 14 12

    The computation requires the unknown If we insert the FGLS estimate based on 14 9 and use the result that the denominator converges to one then in large samples the statistic will behave the same as 1 F R q R Var R 1 R q 14 13 J This can be referred to the standard F table Because it uses the estimated even with normally distributed disturbances the F distribution is only valid approximately In general the statistic F J n converges to 1 J times a chi squared J as n

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    347

    Therefore an alternative test statistic that has a limiting chi squared distribution with J degrees of freedom when the hypothesis is true is J F R q RVar R 1 R q 14 14

    This can be recognized as a Wald statistic that measures the distance between R and q Both statistics are valid asymptotically but 14 13 may perform better in a small or moderately sized sample 13 Once again the divisor used in computing i j may make a difference but there is no general rule A hypothesis of particular interest is the homogeneity restriction of equal coef cient vectors in the multivariate regression model That case is fairly common in this setting The homogeneity restriction is that i M i 1 M 1 Consistent with 14 13 14 14 we would form the hypothesis as I 0 0 I 1 1 M 0 I 0 I M 2 2 0 R 14 15 M M 1 M 0 0 I I This speci es a total of M 1 K restrictions on the KM 1 parameter vector Denote the estimated asymptotic covariance for i j as Vi j The bracketed matrix in 14 13 would have typical block R Var R i j Vii Vi j V ji V j j This may be a considerable amount of computation The test will be simpler if the model has been t by maximum likelihood as we examine in the next section
    14 2 4 MAXIMUM LIKELIHOOD ESTIMATION

    The Oberhofer Kmenta 1974 conditions see Section 11 7 2 are met for the seemingly unrelated regressions model so maximum likelihood estimates can be obtained by iterating the FGLS procedure We note once again that this procedure presumes the use of 14 9 for estimation of i j at each iteration Maximum likelihood enjoys no advantages over FGLS in its asymptotic properties 14 Whether it would be preferable in a small sample is an open question whose answer will depend on the particular data set By simply inserting the special form of in the log likelihood function for the generalized regression model in 10 32 we can consider direct maximization instead of iterated FGLS It is useful however to reexamine the model in a somewhat different formulation This alternative construction of the likelihood function appears in many other related models in a number of literatures
    13 See

    Judge et al 1985 p 476 The Wald statistic often performs poorly in the small sample sizes typical in this area Feibig 2001 pp 108 110 surveys a recent literature on methods of improving the power of testing procedures in SUR models 1995 considers some variation on the computation of the asymptotic covariance matrix for the estimator that allows for the possibility that the normality assumption might be violated

    14 Jensen

    Greene 50240

    book

    June 19 2002

    10 4

    348

    CHAPTER 14 Systems of Regression Equations

    Consider one observation on each of the M dependent variables and their associated regressors We wish to arrange this observation horizontally instead of vertically The model for this observation can be written y1 y2 yM t x 1 t x t 2 M 1 2 M t E 14 16

    where x is the full set of all K different independent variables that appear in the model t The parameter matrix then has one column for each equation but the columns are not the same as i in 14 4 unless every variable happens to appear in every equation Otherwise in the i th equation i will have a number of zeros in it each one imposing an exclusion restriction For example consider the GM and GE equations from the Boot de Witt data in Example 14 1 The t th observation would be g e 1g 0 Ig Ie t 1 Fg Cg Fe Ce t 2g 0 g e t 0 1e 0 2e This vector is one observation Let t be the vector of M disturbances for this observation arranged for now in a column Then E t t The log of the joint normal density of these M disturbances is M 1 1 14 17 log 2 log t 1 t 2 2 2 The log likelihood for a sample of T joint observations is the sum of these over t log Lt
    T

    log L
    t 1

    log Lt

    T 1 MT log 2 log 2 2 2

    T

    t
    t 1

    1

    t

    14 18

    The term in the summation in 14 18 is a scalar that equals its trace We can always permute the matrices in a trace so
    T T T

    t
    t 1

    1

    t
    t 1

    tr t

    1

    t
    t 1

    tr

    1

    t t

    This can be further simpli ed The sum of the traces of T matrices equals the trace of the sum of the matrices see A 91 We will now also be able to move the constant matrix 1 outside the summation Finally it will prove useful to multiply and divide by T Combining all three steps we obtain
    T

    tr
    t 1

    1

    t t T tr

    1

    1 T

    T

    t t T tr
    t 1

    1

    W

    14 19

    where Wi j 1 T
    T

    ti t j
    t 1

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    349

    Since this step uses actual disturbances E Wi j i j W is the M M matrix we would use to estimate if the s were actually observed Inserting this result in the log likelihood we have T log L M log 2 log tr 2 We now consider maximizing this function It has been shown15 that T log L X E 1 2
    1

    W

    14 20

    14 21 T 1 log L 1 W 2 where the x in 14 16 is row t of X Equating the second of these derivatives to a zero t matrix we see that given the maximum likelihood estimates of the slope parameters the maximum likelihood estimator of is W the matrix of mean residual sums of squares and cross products that is the matrix we have used for FGLS Notice that there is no correction for degrees of freedom log L 0 implies 14 9 We also know that because this model is a generalized regression model the maximum likelihood estimator of the parameter matrix must be equivalent to the FGLS estimator we discussed earlier 16 It is useful to go a step further If we insert our solution for in the likelihood function then we obtain the concentrated log likelihood T log Lc M 1 log 2 log W 14 22 2 We have shown therefore that the criterion for choosing the maximum likelihood estimator of is 1 ML Min 2 log W 14 23 subject to the exclusion restrictions This important result reappears in many other models and settings This minimization must be done subject to the constraints in the parameter matrix In our two equation example there are two blocks of zeros in the parameter matrix which must be present in the MLE as well The estimator of is the set of nonzero elements in the parameter matrix in 14 16 The likelihood ratio statistic is an alternative to the F statistic discussed earlier for testing hypotheses about The likelihood ratio statistic is 2 log Lr log Lu T log Wr log Wu 17 14 24 where Wr and Wu are the residual sums of squares and cross product matrices using the constrained and unconstrained estimators respectively The likelihood ratio statistic is asymptotically distributed as chi squared with degrees of freedom equal to the number of restrictions This procedure can also be used to test the homogeneity restriction in the multivariate regression model The restricted model is the covariance structures model discussed in Section 13 9 in the preceding chapter
    15 See 16 This 17 See

    for example Joreskog 1973 equivalence establishes the Oberhofer Kmenta conditions Att eld 1998 for re nements of this calculation to improve the small sample performance

    Greene 50240

    book

    June 19 2002

    10 4

    350

    CHAPTER 14 Systems of Regression Equations

    It may also be of interest to test whether is a diagonal matrix Two possible approaches were suggested in Section 13 9 6 see 13 67 and 13 68 The unrestricted model is the one we are using here whereas the restricted model is the groupwise heteroscedastic model of Section 11 7 2 Example 11 5 without the restriction of equalparameter vectors As such the restricted model reduces to separate regression models estimable by ordinary least squares The likelihood ratio statistic would be
    M

    LR T
    i 1

    log i2 log

    14 25

    where i2 is ei ei T from the individual least squares regressions and is the maximum likelihood estimator of This statistic has a limiting chi squared distribution with M M 1 2 degrees of freedom under the hypothesis The alternative suggested by Breusch and Pagan 1980 is the Lagrange multiplier statistic
    M i 1

    LM T
    i 2 j 1

    ri2j

    14 26

    where ri j is the estimated correlation i j ii j j 1 2 This statistic also has a limiting chi squared distribution with M M 1 2 degrees of freedom This test has the advantage that it does not require computation of the maximum likelihood estimator of since it is based on the OLS residuals
    Example 14 2 Estimates of a Seemingly Unrelated Regressions Model

    By relaxing the constraint that all ve rms have the same parameter vector we obtain a veequation seemingly unrelated regression model The FGLS estimates for the system are given in Table 14 1 where we have included the equality constrained pooled estimator from the covariance structures model in Table 13 4 for comparison The variables are the constant terms F and C respectively The correlations of the FGLS and equality constrained FGLS residuals are given below the coef cient estimates in Table 14 1 The assumption of equal parameter vectors appears to have seriously distorted the correlations computed earlier We would have expected this based on the comparison of Figures 14 1 and 14 2 The diagonal elements in are also drastically in ated by the imposition of the homogeneity constraint The equation by equation OLS estimates are given in Table 14 2 As expected the estimated standard errors for the FGLS estimates are generally smaller The F statistic for testing the hypothesis of equal parameter vectors in all ve equations is 129 169 with 12 and 100 15 degrees of freedom This value is far larger than the tabled critical value of 1 868 so the hypothesis of parameter homogeneity should be rejected We might have expected this result in view of the dramatic reduction in the diagonal elements of compared with those of the pooled estimator The maximum likelihood estimates of the parameters are given in Table 14 3 The log determinant of the unrestricted maximum likelihood estimator of is 31 71986 so the log likelihood is log L u 20 20 5 log 2 1 31 71986 459 0925 2 2

    The restricted model with equal parameter vectors and correlation across equations is discussed in Section 13 9 6 and the restricted MLEs are given in Table 13 4 The estimate of is not shown there The log determinant for the constrained model is 39 1385 The loglikelihood for the constrained model is therefore 515 422 The likelihood ratio test statistic is 112 66 The 1 percent critical value from the chi squared distribution with 12 degrees of freedom is 26 217 so the hypothesis that the parameters in all ve equations are equal is once again rejected

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    351

    TABLE 14 1 GM

    FGLS Parameter Estimates Standard Errors in Parentheses
    CH GE WE US Pooled

    1 2 2

    162 36 89 46 0 12049 0 0216 0 38275 0 0328 7216 04 10050 52 313 70 4 8051 605 34 7160 67 129 89 1400 75 2686 5 4439 99

    0 5043 11 51 0 06955 0 0169 0 3086 0 0259 0 299 0 349 152 85 305 61 2 0474 1966 65 16 661 123 921 455 09 2158 595

    22 439 25 52 0 03729 0 0123 0 13078 0 0221 0 269 0 248 0 006 0 158 700 46 34556 6 200 32 4274 0 1224 4 28722 0

    1 0889 6 2959 0 05701 0 0114 0 0415 0 0412 0 257 0 356 0 238 0 246 0 777 0 895 94 912 833 6 652 72 2893 7

    85 423 111 9 0 1015 0 0547 0 3999 0 1278 0 330 0 716 0 384 0 244 0 482 0 176 0 699 0 040 9188 2 34468 9

    28 247 4 888 0 08910 0 00507 0 3340 0 0167

    FGLS Residual Covariance and Correlation Matrices Pooled estimates GM CH GE WE US

    TABLE 14 2 GM

    OLS Parameter Estimates Standard Errors in Parentheses
    CH GE WE US Pooled

    1 2 2 2

    149 78 105 84 0 11928 0 0258 0 37144 0 0371 7160 29

    6 1899 13 506 0 07795 0 0198 0 3157 0 0288 149 872

    9 956 31 374 0 02655 0 0157 0 15169 0 0257 660 329

    0 5094 8 0152 0 05289 0 0157 0 0924 0 0561 88 662

    30 369 157 05 0 1566 0 0789 0 4239 0 1552 8896 42

    48 030 21 480 0 10509 0 01378 0 30537 0 04351 15857 24

    Based on the OLS results the Lagrange multiplier statistic is 29 046 with 10 degrees of freedom The 1 percent critical value is 23 209 so the hypothesis that is diagonal can also be rejected To compute the likelihood ratio statistic for this test we would compute the log determinant based on the least squares results This would be the sum of the logs of the residual variances given in Table 14 2 which is 33 957106 The statistic for the likelihood ratio test using 14 25 is therefore 20 33 95706 31 71986 44 714 This is also larger than the critical value from the table Based on all these results we conclude that neither the parameter homogeneity restriction nor the assumption of uncorrelated disturbances appears to be consistent with our data
    14 2 5 AN APPLICATION FROM FINANCIAL ECONOMETRICS THE CAPITAL ASSET PRICING MODEL

    One of the growth areas in econometrics is its application to the analysis of nancial markets 18 The capital asset pricing model CAPM is one of the foundations of that eld and is a frequent subject of econometric analysis
    18 The pioneering work of Campbell Lo and MacKinlay 1997 is a broad survey of the eld The development

    in this example is based on their Chapter 5

    Greene 50240

    book

    June 19 2002

    10 4

    352

    CHAPTER 14 Systems of Regression Equations

    TABLE 14 3 GM

    Maximum Likelihood Estimates
    CH GE WE US Pooled

    1 2 2

    173 218 84 30 0 122040 0 02025 0 38914 0 03185 7307 30 330 55 550 27 118 83 2879 10

    2 39111 11 63 0 06741 0 01709 0 30520 0 02606

    16 662 24 96 0 0371 0 0118 0 11723 0 0217

    4 37312 6 018 0 05397 0 0103 0 026930 0 03708

    136 969 94 8 0 08865 0 0454 0 31246 0 118

    2 217 1 960 0 02361 0 00429 0 17095 0 0152

    Residual Covariance Matrix GM CH GE WE US 155 08 11 429 18 376 463 21

    741 22 220 33 1408 11

    103 13 734 83

    9671 4

    Markowitz 1959 developed a theory of an individual investor s optimal portfolio selection in terms of the trade off between expected return mean and risk variance Sharpe 1964 and Lintner 1965 showed how the theory could be extended to the aggregate market portfolio The Sharpe and Lintner analyses produce the following model for the expected excess return from an asset i E Ri R f i E Rm R f where Ri is the return on asset i R f is the return on a risk free asset Rm is the return on the market s optimal portfolio and i is the asset s market beta i Cov Ri Rm Var Rm

    The theory states that the expected excess return on asset i will equal i times the expected excess return on the market s portfolio Black 1972 considered the more general case in which there is no risk free asset In this instance the observed R f is replaced by the unobservable return on a zero beta portfolio E R0 The empirical counterpart to the Sharpe and Lintner model for assets i 1 N observed over T periods t 1 T is a seemingly unrelated regressions SUR model which we cast in the form of 14 16 y1 y2 yN 1 zt 1 1 2 2 N 1 2 N t xt N t

    where yit is Rit R f t the observed excess return on asset i in period t zt is Rmt R f t the market excess return in period t and disturbances it are the deviations from the conditional means We de ne the T 2 matrix X 1 zt t 1 T The assumptions of the seemingly unrelated regressions model are 1 2 3 E t X E t 0 Var t X E t t X t X N 0 a positive de nite N N matrix

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    353

    The data are also assumed to be well behaved so that 4 5 plim z E zt z 2 plim sz plim 1 T
    T t 1 zt 2 z 2 Var zt z

    Since this model is a particular case of the one in 14 16 we can proceed to 14 20 through 14 23 for the maximum likelihood estimators of and Indeed since this model is an unrestricted SUR model with the same regressor s in every equation we know from our results in Section 14 2 2 that the GLS and maximum likelihood estimators are simply equation by equation ordinary least squares and that the estimator of is just S the sample covariance matrix of the least squares residuals The asymptotic covariance matrix for the 2 N 1 estimator a b will be Asy Var a b 1 plim T XX T
    1





    2 z 2 1 z 2 T z z

    z 1





    which we will estimate with X X 1 S Plim z z T plim 1 T t zt z 2 z2 2 z 2 z The model above does not impose the Markowitz Sharpe Lintner hypothesis H0 0 A Wald test of H0 can be based on the unrestricted least squares estimates W a 0 Est Asy Var a 0
    1

    a 0 a X X 11 S 1 a

    2 sz

    2 T sz a S 1 a z2

    To carry out this test we now require that T be greater than or equal to N so that S 1 T t et et will have full rank The assumption was not necessary until this point Under the null hypothesis the statistic has a limiting chi squared distribution with N degrees of freedom The small sample misbehavior of the Wald statistic has been widely observed An alternative that is likely to be better behaved is T N 1 N W which is exactly distributed as F N T N 1 under the null hypothesis To carry out a likelihood ratio or Lagrange multiplier test of the null hypothesis we will require the restricted estimates By setting 0 in the model we obtain once again a SUR model with identical regressor so the restricted maximum likelihood estimators are a0i 0 and b0i yi z z z The restricted estimator of is as before the matrix of mean squares and cross products of the residuals now S0 The chi squared statistic for the likelihood ratio test is given in 14 24 for this application it would be N ln S0 ln S To compute the LM statistic we will require the derivatives of the unrestricted loglikelihood function evaluated at the restricted estimators which are given in 14 21 For this model they may be written ln L i where is the ijth element of
    ij n T N

    ij
    j 1 1 n t 1

    jt


    j 1

    i j T j

    and
    T N

    ln L i

    ij
    j 1 t 1

    zt jt


    j 1

    i j z j

    Greene 50240

    book

    June 19 2002

    10 4

    354

    CHAPTER 14 Systems of Regression Equations

    The rst derivatives with respect to will be zero at the restricted estimates since the terms in parentheses are the normal equations for restricted least squares remember the residuals are now e0it yit b0i zt The rst vector of rst derivatives can be written as ln L 1 E i 1 T where i is a T 1 vector of 1s E is a T N matrix of disturbances and is the N 1 vector of means of asset speci c disturbances The second subvector is ln L 1 E z Since ln L 0 at the restricted estimates the LM statistic involves only the upper left submatrix of H 1 Combining terms and inserting the restricted estimates we obtain 0 LM T e0 S 1 0 X X S 1 0
    1

    0 T e0 S 1 0

    0 T 2 X X 11 e0 S 1 e0 T
    2 sz z2 0 e0 S 1 e0 2 sz

    Under the null hypothesis the limiting distribution of LM is chi squared with N degrees of freedom The model formulation gives E Rit R f t i E Rmt R f t If there is no riskfree asset but we write the model in terms of the unknown return on a zero beta portfolio then we obtain Rit i Rmt it 1 i i Rmt it This is essentially the same as the original model with two modi cations First the observables in the model are real returns not excess returns which de nes the way the data enter the model Second there are nonlinear restrictions on the parameters i 1 i Although the unrestricted model has 2 N free parameters Black s formulation implies N 1 restrictions and leaves N 1 free parameters The nonlinear restrictions will complicate nding the maximum likelihood estimators We do know from 14 21 that regardless of what the estimators of i and are the estimator of is still S 1 T E E So we can concentrate the log likelihood function The Oberhofer and Kmenta 1974 results imply that we may simply zigzag back and forth between S and See Section 11 7 2 Second although maximization over remains complicated maximization over for known is trivial For a given value of the maximum likelihood estimator of i is the slope in the linear regression without a constant term of Rit on Rmt Thus the full set of maximum likelihood estimators may be found just by scanning over the admissible range of to locate the value that maximizes 1 ln Lc ln S 2 where T Rit 1 i i Rmt Rjt 1 j j Rmt si j t 1 T

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    355

    and i
    T t 1 Rit Rmt T 2 t 1 Rmt





    For inference purposes an estimator of the asymptotic covariance matrix of the estimators is required The log likelihood for this model is T 1 ln L N ln 2 ln 2 2
    T

    t
    t 1

    1

    t

    where the N 1 vector t is it Rit 1 i i Rmt i 1 N The derivatives of the log likelihood can be written ln L
    T t 1

    Rmt i

    1 1

    t

    T

    t


    t 1

    gt

    We have omitted from the gradient because the expected Hessian is block diagonal and at present is tangential With the derivatives in this form we have E g t gt Rmt 2
    1 1

    Rmt i

    1 1

    i

    Rmt i
    T

    i



    14 27

    Now sum this expression over t and use the result that
    T

    Rmt 2
    t 1 t 1

    2 Rmt Rm 2 T Rm 2 T s Rm Rm 2

    to obtain the negative of the expected Hessian E T 2 ln L
    2 s Rm Rm 2 1 1

    Rm i

    1 1

    i

    Rm i

    i



    14 28

    The inverse of this matrix provides the estimator for the asymptotic covariance matrix Using A 74 after some manipulation we nd that Asy Var 1 Rm 2 i 1 2 T Rm
    1

    i 1

    2 2 where Rm plim Rm and Rm plim s Rm A likelihood ratio test of the Black model requires the restricted estimates of the parameters The unrestricted model is the SUR model for the real returns Rit on the market returns Rmt with N free constants i and N free slopes i Result 14 24 provides the test statistic Once the estimates of i and are obtained the implied estimates of i are given by i 1 i With these estimates in hand the LM statistic is exactly what it was before although now all 2 N derivatives will be required and X is i Rm The subscript indicates computation at the restricted estimates

    LM T

    2 m s Rm R2 e S 1 e 2 s Rm

    1 RmE S 1 E Rm 2 Ts Rm

    2 Rm RmE S 1 e 2 sz

    Greene 50240

    book

    June 19 2002

    10 4

    356

    CHAPTER 14 Systems of Regression Equations

    A Wald test of the Black model would be based on the unrestricted estimators The hypothesis appears to involve the unknown but in fact the theory implies only the N 1 nonlinear restrictions i N 1 i 1 N 0 or i 1 N N 1 i 0 Write this set of N 1 functions as c 0 The Wald statistic based on the least squares estimates would then be W c a b Est Asy Var c a b
    1

    c a b

    Recall in the unrestricted model that Asy Var a b 1 T plim X X T 1 say Using the delta method see Section D 2 7 the asymptotic covariance matrix for c a b would be Asy Var c a b where c

    The ith row of the 2 N 2 N matrix has four only nonzero elements one each in the ith and Nth positions of each of the two subvectors Before closing this lengthy example we reconsider the assumptions of the model There is ample evidence e g Af eck Graves and McDonald 1989 that the normality assumption used in the preceding is not appropriate for nancial returns This fact in itself does not complicate the analysis very much Although the estimators derived earlier are based on the normal likelihood they are really only generalized least squares As we have seen before in Chapter 10 GLS is robust to distributional assumptions The LM and LR tests we devised are not however Without the normality assumption only the Wald statistics retain their asymptotic validity As noted the small sample behavior of the Wald statistic can be problematic The approach we have used elsewhere is to use an approximation F W J where J is the number of restrictions and refer the statistic to the more conservative critical values of the F J q distribution where q is the number of degrees of freedom in estimation Thus once again the role of the normality assumption is quite minor The homoscedasticity and nonautocorrelation assumptions are potentially more problematic The latter almost certainly invalidates the entire model See Campbell Lo and MacKinlay 1997 for discussion If the disturbances are only heteroscedastic then we can appeal to the well established consistency of ordinary least squares in the generalized regression model A GMM approach might seem to be called for but GMM estimation in this context is irrelevant In all cases the parameters are exactly identi ed What is needed is a robust covariance estimator for our now pseudomaximum likelihood estimators For the Sharpe Lintner formulation nothing more than the White estimator that we developed in Chapters 10 and 11 is required after all despite the complications of the models the estimators both with and without the restrictions are ordinary least squares equation by equation For each equation separately the robust asymptotic covariance matrix in 10 14 applies For the least squares estimators qi ai bi we seek a robust estimator of Asy Cov qi q j T plim X X 1 X i j X X X 1 Assuming that E it jt i j this matrix can be estimated with
    T

    Est Asy Cov qi q j X X
    t 1

    1

    xt xt eit e jt

    X X 1

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    357

    To form a counterpart for the Black model we will once again rely on the assumption that the asymptotic covariance of the MLE of and the MLE of is zero Then the sandwich estimator for this M estimator see Section 17 8 is Est Asy Var A 1 BA 1 where A appears in 14 28 and B is in 14 27
    14 2 6 MAXIMUM LIKELIHOOD ESTIMATION OF THE SEEMINGLY UNRELATED REGRESSIONS MODEL WITH A BLOCK OF ZEROS IN THE COEFFICIENT MATRIX

    In Section 14 2 2 we considered the special case of the SUR model with identical regressors in all equations We showed there that in this case OLS and GLS are identical In the SUR model with normally distributed disturbances GLS is the maximum likelihood estimator It follows that when the regressors are identical OLS is the maximum likelihood estimator In this section we consider a related case in which the coef cient matrix contains a block of zeros The block of zeros is created by excluding the same subset of the regressors from some of but not all the equations in a model that without the exclusion restriction is a SUR with the same regressors in all equations This case can be examined in the context of the derivation of the GLS estimator in 14 7 but it is much simpler to obtain the result we seek for the maximum likelihood estimator The model we have described can be formulated as in 14 16 as follows We rst transpose the equation system in 14 16 so that observation t on y1 yM is written yt xt t

    If we collect all T observations in this format then the system would appear as Y M T M K X K T E M T

    Each row of contains the parameters in a particular equation Now consider once again a particular observation and partition the set of dependent variables into two groups of M1 and M2 variables and the set of regressors into two sets of K1 and K2 variables The equation system is now y1 y2
    t 11 21 12 22

    x1 1 x2 t 2

    E
    t

    1 X 2


    t

    0 Var 1 X 2 0


    t

    11 21

    12 22



    Since this system is still a SUR model with identical regressors the maximum likelihood estimators of the parameters are obtained using equation by equation least squares regressions The case we are interested in here is the restricted model with 12 0 which has the effect of excluding x2 from all the equations for y1 The results we will obtain for this case are 1 The maximum likelihood estimator of 11 when 12 0 is equation by equation least squares regression of the variables in y1 on x1 alone That is even with the restriction the ef cient estimator of the parameters of the rst set of equations is

    Greene 50240

    book

    June 19 2002

    10 4

    358

    CHAPTER 14 Systems of Regression Equations

    2

    equation by equation ordinary least squares Least squares is not the ef cient estimator for the second set however The effect of the restriction on the likelihood function can be isolated to its effect on the smaller set of equations Thus the hypothesis can be tested without estimating the larger set of equations

    We begin by considering maximum likelihood estimation of the unrestricted system The log likelihood function for this multivariate regression model is
    T

    ln L
    t 1

    ln f y1t y2t x1t x2t

    where f y1t y2t x1t x2t is the joint normal density of the two vectors This result is 14 17 through 14 19 in a different form We will now write this joint normal density as the product of a marginal and a conditional f y1t y2t x1t x2t f y1t x1t x2t f y2t y1t x1t x2t The mean and variance of the marginal distribution for y1t are just the upper portions of the preceding partitioned matrices E y1t x1t x2t
    11 x1t



    12 x2t

    Var y1t x1t x2t

    11

    The results we need for the conditional distribution are given in Theorem B 6 Collecting terms we have E y2t y1t x1t x2t Var y2t y1t x1t x2t
    21



    21

    1 11

    11

    x1t
    22

    22



    21

    1 11

    12

    x2t

    21

    1 11

    y1t

    21 x1t 22

    22 x2t 1 11

    y1t
    12



    21



    Finally since the marginal distributions and the joint distribution are all multivariate normal the conditional distribution is also The objective of this partitioning is to partition the log likelihood function likewise
    T

    ln L
    t 1 T

    ln f y1t y2t x1t x2t ln f y1t x1t x2t f y2t y1t x1t x2t
    t 1 T T


    t 1

    ln f y1t x1t x2t
    t 1

    ln f y2t y1t x1t x2t

    With no restrictions on any of the parameters we can maximize this log likelihood by maximizing its parts separately There are two multivariate regression systems de ned by the two parts and they have no parameters in common Because 21 22 21 and 22 are all free unrestricted parameters there are no restrictions imposed on 21 22 or 22 Therefore in each case the ef cient estimators are equation byequation ordinary least squares The rst part produces estimates of 11 22 and 11 directly From the second we would obtain estimates of 21 22 and 22 But it is

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    359

    easy to see in the relationships above how the original parameters can be obtained from these mixtures
    21 22 21 22



    21 22


    11

    11 12

    22



    11



    Because of the invariance of maximum likelihood estimators to transformation these derived estimators of the original parameters are also maximum likelihood estimators Thus the result we have up to this point is that by manipulating this pair of sets of ordinary least squares estimators we can obtain the original least squares ef cient estimators This result is no surprise of course since we have just rearranged the original system and we are just rearranging our least squares estimators Now consider estimation of the same system subject to the restriction 12 0 The second equation system is still completely unrestricted so maximum likelihood estimates of its parameters 21 22 which now equals 22 and 22 are still obtained by equation by equation least squares The equation systems have no parameters in common so maximum likelihood estimators of the rst set of parameters are obtained by maximizing the rst part of the log likelihood once again by equation by equation ordinary least squares Thus our rst result is established To establish the second result we must obtain the two parts of the log likelihood The log likelihood function for this model is given in 14 20 Since each of the two sets of equations is estimated by least squares in each case null and alternative for each part the term in the log likelihood is the concentrated log likelihood given in 14 22 where W j j is 1 T times the matrix of sums of squares and cross products of least squares residuals The second set of equations is estimated by regressions on x1 x2 and y1 with or without the restriction 12 0 So the second part of the log likelihood is always the same T ln L2c M2 1 ln 2 ln W22 2 The concentrated log likelihood for the rst set of equations equals T ln L1c M1 1 ln 2 ln W11 2 when x2 is included in the equations and the same with W11 12 0 when x2 is excluded At the maximum likelihood estimators the log likelihood for the whole system is ln Lc ln L1c ln L2c The likelihood ratio statistic is 2 ln Lc
    12

    0 ln Lc T ln W11

    12

    0 ln W11

    This establishes our second result since W11 is based only on the rst set of equations The block of zeros case was analyzed by Goldberger 1970 Many regression systems in which the result might have proved useful e g systems of demand equations

    Greene 50240

    book

    June 19 2002

    10 4

    360

    CHAPTER 14 Systems of Regression Equations

    imposed cross equation equality symmetry restrictions so the result of the analysis was often derailed Goldberger s result however is precisely what is needed in the more recent application of testing for Granger causality in the context of vector autoregressions We will return to the issue in Section 19 6 5
    14 2 7 AUTOCORRELATION AND HETEROSCEDASTICITY

    The seemingly unrelated regressions model can be extended to allow for autocorrelation in the same fashion as in Section 13 9 5 To reiterate suppose that yi Xi i i it i i t 1 uit where uit is uncorrelated across observations This extension will imply that the blocks in in 14 3 instead of i j I are i j i j where i j is given in 13 63 The treatment developed by Parks 1967 is the one we used earlier 19 It calls for a three step approach 1 Estimate each equation in the system by ordinary least squares Compute any consistent estimators of For each equation transform the data by the Prais Winsten transformation to remove the autocorrelation 20 Note that there will not be a constant term in the transformed data because there will be a column with 1 ri2 1 2 as the rst observation and 1 ri for the remainder Using the transformed data use ordinary least squares again to estimate Use FGLS based on the estimated and the transformed data

    2 3

    There is no bene t to iteration The estimator is ef cient at every step and iteration does not produce a maximum likelihood estimator because of the Jacobian term in the log likelihood see 12 30 After the last step should be reestimated with the GLS estimates The estimated covariance matrix for can then be reconstructed using mn mn 1 rmrn

    As in the single equation case opinions differ on the appropriateness of such corrections for autocorrelation At one extreme is Mizon 1995 who argues forcefully that autocorrelation arises as a consequence of a remediable failure to include dynamic effects in the model However in a system of equations the analysis that leads to this
    19 Guilkey and Schmidt 1973 Guilkey 1974 and Berndt and Savin 1977 present an alternative treatment

    based on t R t 1 ut where t is the M 1 vector of disturbances at time t and R is a correlation matrix Extensions and additional results appear in Moschino and Moro 1994 McLaren 1996 and Holt 1998
    20 There

    is a complication with the rst observation that is not treated quite correctly by this procedure For details see Judge et al 1985 pp 486 489 The strictly correct and quite cumbersome results are for the true GLS estimator which assumes a known It is unlikely that in a nite sample anything is lost by using the Prais Winsten procedure with the estimated One suggestion has been to use the Cochrane Orcutt procedure and drop the rst observation But in a small sample the cost of discarding the rst observation is almost surely greater than that of neglecting to account properly for the correlation of the rst disturbance with the other rst disturbances

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    361

    TABLE 14 4

    Autocorrelation Coef cients
    GM CH GE WE US

    Durbin Watson Autocorrelation GM CH GE WE US 1 2 3

    0 9375 0 531 6679 5 220 97 483 79 88 373 1381 6

    1 984 0 008

    1 0721 0 463

    1 413 0 294

    0 9091 0 545

    Residual Covariance Matrix i j 1 ri r j

    151 96 43 7891 19 964 342 89

    684 59 190 37 1484 10

    92 788 676 88

    8638 1 14 0207 96 49 0 16415 0 0386 0 2006 0 1428

    Parameter Estimates Standard Errors in Parentheses 51 337 0 4536 24 913 4 7091 80 62 11 86 25 67 6 510 0 094038 0 06847 0 04271 0 05091 0 01733 0 0174 0 01134 0 01060 0 040723 0 32041 0 10954 0 04284 0 04216 0 0258 0 03012 0 04127

    conclusion is going to be far more complex than in a single equation model 21 Suf ce to say the issue remains to be settled conclusively
    Example 14 3 Autocorrelation in a SUR Model

    Table 14 4 presents the autocorrelation corrected estimates of the model of Example 14 2 The Durbin Watson statistics for the ve data sets given here with the exception of Chrysler strongly suggest that there is indeed autocorrelation in the disturbances The differences between these and the uncorrected estimates given earlier are sometimes relatively large as might be expected given the fairly high autocorrelation and small sample size The smaller diagonal elements in the disturbance covariance matrix compared with those of Example 14 2 re ect the improved t brought about by introducing the lagged variables into the equation

    In principle the SUR model can accommodate heteroscedasticity as well as autocorrelation Bartels and Feibig 1991 suggested the generalized SUR model A I A where A is a block diagonal matrix Ideally A is made a function of measured characteristics of the individual and a separate parameter vector so that the model can be estimated in stages In a rst step OLS residuals could be used to form a preliminary estimator of then the data are transformed to homoscedasticity leaving and to be estimated at subsequent steps using transformed data One application along these lines is the random parameters model of Feibig Bartels and Aigner 1991 13 46 shows how the random parameters model induces heteroscedasticity Another application is Mandy and Martins Filho who speci ed i j t i j zi j t The linear speci cation of a variance does present some problems as a negative value is not precluded Kumbhakar and Heshmati 1996 proposed a cost and demand
    21 Dynamic SUR models in the spirit of Mizon s admonition were proposed by Anderson and Blundell 1982

    A few recent applications are Kiviet Phillips and Schipp 1995 and Deschamps 1998 However relatively little work has been done with dynamic SUR models The VAR models in Chapter 20 are an important group of applications but they come from a different analytical framework

    Greene 50240

    book

    June 19 2002

    10 4

    362

    CHAPTER 14 Systems of Regression Equations

    system that combined the translog model of Section 14 3 2 with the complete equation system in 14 3 1 In their application only the cost equation was speci ed to include a heteroscedastic disturbance

    14 3

    SYSTEMS OF DEMAND EQUATIONS SINGULAR SYSTEMS

    Most of the recent applications of the multivariate regression model22 have been in the context of systems of demand equations either commodity demands or factor demands in studies of production
    Example 14 4 Stone s Expenditure System

    Stone s expenditure system23 based on a set of logarithmic commodity demand equations income Y and commodity prices pn is log qi i i log Y P
    M


    j 1

    i j log

    pj P



    where P is a generalized share weighted price index i is an income elasticity and i j is a compensated price elasticity We can interpret this system as the demand equation in real expenditure and real prices The resulting set of equations constitutes an econometric model in the form of a set of seemingly unrelated regressions In estimation we must account for a number of restrictions including homogeneity of degree one in income i i 1 and symmetry of the matrix of compensated price elasticities i j i j

    Other examples include the system of factor demands and factor cost shares from production which we shall consider again later In principle each is merely a particular application of the model of the previous section But some special problems arise in these settings First the parameters of the systems are generally constrained across equations That is the unconstrained model is inconsistent with the underlying theory 24 The numerous constraints in the system of demand equations presented earlier give an example A second intrinsic feature of many of these models is that the disturbance covariance matrix is singular
    22 Note 23 A

    the distinction between the multivariate or multiple equation model discussed here and the multiple regression model

    very readable survey of the estimation of systems of commodity demands is Deaton and Muellbauer 1980 The example discussed here is taken from their Chapter 3 and the references to Stone s 1954a b work cited therein A counterpart for production function modeling is Chambers 1988 Recent developments in the speci cation of systems of demand equations include Chavez and Segerson 1987 Brown and Walker 1995 and Fry Fry and McLaren 1996
    24 This

    inconsistency does not imply that the theoretical restrictions are not testable or that the unrestricted model cannot be estimated Sometimes the meaning of the model is ambiguous without the restrictions however Statistically rejecting the restrictions implied by the theory which were used to derive the econometric model in the rst place can put us in a rather uncomfortable position For example in a study of utility functions Christensen Jorgenson and Lau 1975 after rejecting the cross equation symmetry of a set of commodity demands stated With this conclusion we can terminate the test sequence since these results invalidate the theory of demand p 380 See Silver and Ali 1989 for discussion of testing symmetry restrictions

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations 14 3 1 COBB DOUGLAS COST FUNCTION EXAMPLE 7 3 CONTINUED

    363

    Consider a Cobb Douglas production function
    M

    Y 0
    i 1

    xi i

    Pro t maximization with an exogenously determined output price calls for the rm to maximize output for a given cost level C or minimize costs for a given output Y The Lagrangean for the maximization problem is
    M

    0
    i 1

    xi i C p x

    where p is the vector of M factor prices The necessary conditions for maximizing this function are i Y pi 0 and C p x 0 xi xi The joint solution provides xi Y p and Y p The total cost of production is
    M M

    pi xi
    i 1 i 1

    i Y

    The cost share allocated to the i th factor is pi xi M i 1 pi xi The full model is25

    i M i 1
    M

    i

    i

    14 29

    ln C 0 y ln Y si i i

    i 1

    i ln pi c

    14 30

    i 1 M

    By construction iM 1 i 1 and iM 1 si 1 This is the cost function analysis begun in Example 7 3 We will return to that application below The cost shares will also sum identically to one in the data It therefore follows that iM 1 i 0 at every data point so the system is singular For the moment ignore the cost function Let the M 1 disturbance vector from the shares be 1 2 M Since i 0 where i is a column of 1s it follows that E i i 0 which implies that is singular Therefore the methods of the previous sections cannot be used here You should verify that the sample covariance matrix of the OLS residuals will also be singular The solution to the singularity problem appears to be to drop one of the equations estimate the remainder and solve for the last parameter from the other M 1 The constraint iM 1 i 1 states that the cost function must be homogeneous of degree one
    25 We

    1

    m m

    leave as an exercise the derivation of 0 which is a mixture of all the parameters and y which equals

    Greene 50240

    book

    June 19 2002

    10 4

    364

    CHAPTER 14 Systems of Regression Equations

    in the prices a theoretical necessity If we impose the constraint M 1 1 2 M 1 then the system is reduced to a nonsingular one log C pM
    M 1

    14 31

    0 y log Y
    i 1

    i log

    pi pM

    c

    si i i i 1 M 1 This system provides estimates of 0 y and 1 M 1 The last parameter is estimated using 14 31 In principle it is immaterial which factor is chosen as the numeraire Unfortunately the FGLS parameter estimates in the now nonsingular system will depend on which one is chosen Invariance is achieved by using maximum likelihood estimates instead of FGLS 26 which can be obtained by iterating FGLS or by direct maximum likelihood estimation 27 Nerlove s 1963 study of the electric power industry that we examined in Example 7 3 provides an application of the Cobb Douglas cost function model His ordinary least squares estimates of the parameters were listed in Example 7 3 Among the results are unfortunately a negative capital coef cient in three of the six regressions Nerlove also found that the simple Cobb Douglas model did not adequately account for the relationship between output and average cost Christensen and Greene 1976 further analyzed the Nerlove data and augmented the data set with cost share data to estimate the complete demand system Appendix Table F14 2 lists Nerlove s 145 observations with Christensen and Greene s cost share data Cost is the total cost of generation in millions of dollars output is in millions of kilowatt hours the capital price is an index of construction costs the wage rate is in dollars per hour for production and maintenance the fuel price is an index of the cost per Btu of fuel purchased by the rms and the data re ect the 1955 costs of production The regression estimates are given in Table 14 5 Least squares estimates of the Cobb Douglas cost function are given in the rst column 28 The coef cient on capital is negative Because i y ln Y ln xi that is a positive multiple of the output elasticity of the i th factor this nding is troubling The third column gives the maximum likelihood estimates obtained in the constrained system Two things to note are the dramatically smaller standard errors and the now positive and reasonable estimate of the capital coef cient The estimates of economies of scale in the basic Cobb Douglas model are 1 y 1 39 column 1 and 1 25 column 3 which suggest some increasing returns to scale Nerlove however had found evidence that at extremely large rm sizes economies of scale diminished and eventually disappeared To account for this essentially a classical U shaped average cost curve he appended a quadratic term in log output in the cost function The single equation and maximum likelihood multivariate regression estimates are given in the second and fourth sets of results
    26 The

    invariance result is proved in Barten 1969 additional results on the method are given by Revankar 1976

    27 Some

    28 Results

    based on Nerlove s full data set are given in Example 7 3 We have recomputed the values given in Table 14 5 Note that Nerlove used base 10 logs while we have used natural logs in our computations

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    365

    TABLE 14 5

    Regression Estimates Standard Errors in Parentheses
    Ordinary Least Squares Multivariate Regression

    0 q qq k 1 f R2 Log W

    4 686 0 721

    0 885 0 0174

    0 00847 0 191 0 594 0 205 0 414 0 0989 0 9516

    3 764 0 153 0 0505 0 0739 0 481 0 445

    0 702 0 0618 0 00536 0 150 0 161 0 0777 0 9581

    7 281 0 798

    0 104 0 0147 0 424 0 00945 0 106 0 00380 0 470 0 0100 12 6726

    5 962 0 303 0 0414 0 424 0 106 0 470

    0 161 0 0570 0 00493 0 00943 0 00380 0 0100

    13 02248

    Estimated Average Cost Function 1 5
    Fitted Actual

    1 2

    UnitCost

    9

    6

    3

    0 0
    FIGURE 14 3

    5000

    10000 MWH

    15000

    20000

    Predicted and Actual Average Costs

    The quadratic output term gives the cost function the expected U shape We can determine the point where average cost reaches its minimum by equating ln C ln q to 1 This is q exp 1 q 2 qq For the multivariate regression this value is q 4527 About 85 percent of the rms in the sample had output less than this so by these estimates most rms in the sample had not yet exhausted the available economies of scale Figure 14 3 shows predicted and actual average costs for the sample In order to obtain a reasonable scale the smallest one third of the rms are omitted from the gure Predicted average costs are computed at the sample averages of the input prices The gure does reveal that that beyond a quite small scale the economies of scale while perhaps statistically signi cant are economically quite small

    Greene 50240

    book

    June 19 2002

    10 4

    366

    CHAPTER 14 Systems of Regression Equations 14 3 2 FLEXIBLE FUNCTIONAL FORMS THE TRANSLOG COST FUNCTION

    The literatures on production and cost and on utility and demand have evolved in several directions In the area of models of producer behavior the classic paper by Arrow et al 1961 called into question the inherent restriction of the Cobb Douglas model that all elasticities of factor substitution are equal to 1 Researchers have since developed numerous exible functions that allow substitution to be unrestricted i e not even constant 29 Similar strands of literature have appeared in the analysis of commodity demands 30 In this section we examine in detail a model of production Suppose that production is characterized by a production function Y f x The solution to the problem of minimizing the cost of producing a speci ed output rate given a set of factor prices produces the cost minimizing set of factor demands xi xi Y p The total cost of production is given by the cost function
    M

    C
    i 1

    pi xi Y p C Y p

    14 32

    If there are constant returns to scale then it can be shown that C Yc p or C Y c p where c p is the unit or average cost function 31 The cost minimizing factor demands are obtained by applying Shephard s 1970 lemma which states that if C Y p gives the minimum total cost of production then the cost minimizing set of factor demands is given by xi C Y p Y c p pi pi 14 33

    Alternatively by differentiating logarithmically we obtain the cost minimizing factor cost shares si log C Y p pi xi log pi C 14 34

    With constant returns to scale ln C Y p log Y log c p so si log c p log pi 14 35

    29 See in particular Berndt and Christensen 1973 Two useful surveys of the topic are Jorgenson 1983 and

    Diewert 1974
    30 See

    for example Christensen Jorgenson and Lau 1975 and two surveys Deaton and Muellbauer 1980 and Deaton 1983 Berndt 1990 contains many useful results

    to scale is y 1 which is equivalent to C Yc p Nerlove s more general version of the cost function allows nonconstant returns to scale See Christensen and Greene 1976 and Diewert 1974 for some of the formalities of the cost function and its relationship to the structure of production

    31 The Cobb Douglas function of the previous section gives an illustration The restriction of constant returns

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    367

    In many empirical studies the objects of estimation are the elasticities of factor substitution and the own price elasticities of demand which are given by i j and ii si ii By suitably parameterizing the cost function 14 32 and the cost shares 14 33 we obtain an M or M 1 equation econometric model that can be used to estimate these quantities 32 The transcendental logarithmic or translog function is the most frequently used exible function in empirical work 33 By expanding log c p in a second order Taylor series about the point log p 0 we obtain
    M

    c 2 c pi p j c pi c p j

    log c 0
    i 1

    log c log pi

    1 log pi 2

    M

    M

    i 1 j 1

    2 log c log pi log p j

    log pi log p j 14 36

    where all derivatives are evaluated at the expansion point If we identify these derivatives as coef cients and impose the symmetry of the cross price derivatives then the cost function becomes log c 0 1 log p1 M log pM 11 22
    1 2 1 2

    log2 p1 12 log p1 log p2 14 37

    log2 p2 MM

    1 2

    log2 pM

    This is the translog cost function If i j equals zero then it reduces to the Cobb Douglas function we looked at earlier The cost shares are given by log c 1 11 log p1 12 log p2 1 M log pM log p1 log c s2 2 12 log p1 22 log p2 2 M log pM log p2 log c sM M 1 M log p1 2 M log p2 MM log pM log pM s1
    32 The 33 See

    14 38

    cost function is only one of several approaches to this study See Jorgenson 1983 for a discussion

    Example 2 4 The function was developed by Kmenta 1967 as a means of approximating the CES production function and was introduced formally in a series of papers by Berndt Christensen Jorgenson and Lau including Berndt and Christensen 1973 and Christensen et al 1975 The literature has produced something of a competition in the development of exotic functional forms The translog function has remained the most popular however and by one account Guilkey Lovell and Sickles 1983 is the most reliable of several available alternatives See also Example 6 2

    Greene 50240

    book

    June 19 2002

    10 4

    368

    CHAPTER 14 Systems of Regression Equations

    The cost shares must sum to 1 which requires in addition to the symmetry restrictions already imposed 1 2 M 1
    M

    i j 0
    i 1 M

    column sums equal zero

    14 39

    i j 0
    j 1

    row sums equal zero

    The system of share equations provides a seemingly unrelated regressions model that can be used to estimate the parameters of the model 34 To make the model operational we must impose the restrictions in 14 39 and solve the problem of singularity of the disturbance covariance matrix of the share equations The rst is accomplished by dividing the rst M 1 prices by the Mth thus eliminating the last term in each row and column of the parameter matrix As in the Cobb Douglas model we obtain a nonsingular system by dropping the Mth share equation We compute maximum likelihood estimates of the parameters to ensure invariance with respect to the choice of which share equation we drop For the translog cost function the elasticities of substitution are particularly simple to compute once the parameters have been estimated i j i j si s j si s j ii ii si si 1 si2 14 40

    These elasticities will differ at every data point It is common to compute them at some central point such as the means of the data 35
    Example 14 5 A Cost Function for U S Manufacturing

    A number of recent studies using the translog methodology have used a four factor model with capital K labor L energy E and materials M the factors of production Among the rst studies to employ this methodology was Berndt and Wood s 1975 estimation of a translog cost function for the U S manufacturing sector The three factor shares used to estimate the model are sK K K K log sL L K L log sE E K E log pK pM pK pM pK pM K L log L L log L E log pL pM pL pM pL pM K E log L E log E E log pE pM pE pM pE pM

    34 The cost function may be included if desired which will provide an estimate of

    0 but is otherwise inessential Absent the assumption of constant returns to scale however the cost function will contain parameters of interest that do not appear in the share equations As such one would want to include it in the model See Christensen and Greene 1976 for an example

    35 They

    will also be highly nonlinear functions of the parameters and the data A method of computing asymptotic standard errors for the estimated elasticities is presented in Anderson and Thursby 1986

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    369

    TABLE 14 6

    Parameter Estimates Standard Errors in Parentheses 0 00134 0 00210 0 00085 0 00330 0 00580 0 00385 0 00339 KM LL LE LM EE EM MM 0 0189 0 07542 0 00476 0 07061 0 01838 0 00299 0 09237 0 00971 0 00676 0 00234 0 01059 0 00499 0 00799 0 02247

    K L E M KK KL KE

    0 05690 0 2534 0 0444 0 6542 0 02951 0 000055 0 01066

    TABLE 14 7

    Estimated Elasticities
    Capital Labor Energy Materials

    Fitted share Actual share Capital Labor Energy Materials

    Cost Shares for 1959 0 05643 0 27451 0 06185 0 27303

    0 04391 0 04563

    0 62515 0 61948

    Implied Elasticities of Substitution 7 783 0 9908 1 643 3 230 0 6021 12 19 0 4581 0 5896 0 8834
    Implied Own Price Elasticities sm mm

    0 3623 0 2265

    0 4392

    0 4510

    0 5353

    Berndt and Wood s data are reproduced in Appendix Table F14 1 Maximum likelihood estimates of the full set of parameters are given in Table 14 6 36 The implied estimates of the elasticities of substitution and demand for 1959 the central year in the data are derived in Table 14 7 using the tted cost shares The departure from the Cobb Douglas model with unit elasticities is substantial For example the results suggest almost no substitutability between energy and labor37 and some complementarity between capital and energy

    14 4

    NONLINEAR SYSTEMS AND GMM ESTIMATION

    We now consider estimation of nonlinear systems of equations The underlying theory is essentially the same as that for linear systems We brie y consider two cases in this section maximum likelihood or FGLS estimation and GMM estimation Since the
    36 These

    estimates are not the same as those reported by Berndt and Wood To purge their data of possible correlation with the disturbances they rst regressed the prices on 10 exogenous macroeconomic variables such as U S population government purchases of labor services real exports of durable goods and U S tangible capital stock and then based their analysis on the tted values The estimates given here are in general quite close to those given by Berndt and Wood For example their estimates of the rst ve parameters are 0 0564 0 2539 0 0442 0 6455 and 0 0254 and Wood s estimate of EL for 1959 is 0 64

    37 Berndt

    Greene 50240

    book

    June 19 2002

    10 4

    370

    CHAPTER 14 Systems of Regression Equations

    theory is essentially that of Section 14 2 4 most of the following will describe practical aspects of estimation Consider estimation of the parameters of the equation system y1 h1 X 1 y2 h2 X 2

    14 41

    y M h M X M There are M equations in total to be estimated with t 1 T observations There are K parameters in the model No assumption is made that each equation has its own parameter vector we simply use some of or all the K elements in in each equation Likewise there is a set of T observations on each of P independent variables x p p 1 P some of or all that appear in each equation For convenience the equations are written generically in terms of the full and X The disturbances are assumed to have zero means and contemporaneous covariance matrix We will leave the extension to autocorrelation for more advanced treatments
    14 4 1 GLS ESTIMATION

    In the multivariate regression model if is known then the generalized least squares estimator of is the vector that minimizes the generalized sum of squares
    M M



    1


    i 1 j 1

    i j yi hi X y j h j X

    14 42

    where is an MT 1 vector of disturbances obtained by stacking the equations and I See 14 3 As we did in Chapter 9 de ne the pseudoregressors as the derivatives of the h X functions with respect to That is linearize each of the equations Then the rst order condition for minimizing this sum of squares is
    1



    M

    M




    i 1 j 1

    i j 2Xi0 j 0

    14 43

    where i j is the i j th element of 1 and Xi0 is a T K matrix of pseudoregressors from the linearization of the i th equation See Section 9 2 3 If any of the parameters in do not appear in the i th equation then the corresponding column of Xi0 will be a column of zeros This problem of estimation is doubly complex In almost any circumstance solution will require an iteration using one of the methods discussed in Appendix E Second of course is that is not known and must be estimated Remember that ef cient estimation in the multivariate regression model does not require an ef cient estimator of only a consistent one Therefore one approach would be to estimate the parameters of each equation separately using nonlinear least squares This method will be inef cient if any of the equations share parameters since that information will be ignored But at this step consistency is the objective not ef ciency The resulting residuals can then be used

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    371

    to compute 1 14 44 E E T The second step of FGLS is the solution of 14 43 which will require an iterative procedure once again and can be based on S instead of With well behaved pseudoregressors this second step estimator is fully ef cient Once again the same theory used for FGLS in the linear single equation case applies here 38 Once the FGLS estimator is obtained the appropriate asymptotic covariance matrix is estimated with S
    M M 1

    Est Asy Var
    i 1 j 1

    s

    ij

    Xi0

    X0 j



    There is a possible aw in the strategy outlined above It may not be possible to t all the equations individually by nonlinear least squares It is conceivable that identi cation of some of the parameters requires joint estimation of more than one equation But as long as the full system identi es all parameters there is a simple way out of this problem Recall that all we need for our rst step is a consistent set of estimators of the elements of It is easy to show that the preceding de nes a GMM estimator see Chapter 18 We can use this result to devise an alternative simple strategy The weighting of the sums of squares and cross products in 14 42 by i j produces an ef cient estimator of Any other weighting based on some positive de nite A would produce consistent although inef cient estimates At this step though ef ciency is secondary so the choice of A I is a convenient candidate Thus for our rst step we can nd to minimize
    M M T


    i 1

    yi hi X yi hi X
    i 1 t 1

    yit hi xit 2

    This estimator is just pooled nonlinear least squares where the regression function varies across the sets of observations This step will produce the we need to compute S
    14 4 2 MAXIMUM LIKELIHOOD ESTIMATION

    With normally distributed disturbances the log likelihood function for this model is still given by 14 18 Therefore estimation of is done exactly as before using the S in 14 44 Likewise the concentrated log likelihood in 14 22 and the criterion function in 14 23 are unchanged Therefore one approach to maximum likelihood estimation is iterated FGLS based on the results in Section 14 2 3 This method will require two levels of iteration however since for each estimated l written as a function of the estimates of obtained at iteration l a nonlinear iterative solution is required to obtain l 1 The iteration then returns to S Convergence is based either on S or if one stabilizes then the other will also The advantage of direct maximum likelihood estimation that was discussed in Section 14 2 4 is lost here because of the nonlinearity of the regressions there is no
    38 Neither

    the nonlinearity nor the multiple equation aspect of this model brings any new statistical issues to the fore By stacking the equations we see that this model is simply a variant of the nonlinear regression model that we treated in Chapter 9 with the added complication of a nonscalar disturbance covariance matrix which we analyzed in Chapter 10 The new complications are primarily practical

    Greene 50240

    book

    June 19 2002

    10 4

    372

    CHAPTER 14 Systems of Regression Equations

    convenient arrangement of parameters into a matrix But a few practical aspects to formulating the criterion function and its derivatives that may be useful do remain Estimation of the model in 14 41 might be slightly more convenient if each equation did have its own coef cient vector Suppose then that there is one underlying parameter vector and that we formulate each equation as hit hi i xit it Then the derivatives of the log likelihood function are built up from T M ln S 1 ij 0 di s xit i e jt j i 1 M i T
    t 1 j 1

    14 45

    It remains to impose the equality constraints that have been built into the model Since each i is built up just by extracting elements from the relevant derivative with respect to is just a sum of those with respect to Ki n ln Lc ln Lc 1 ig k k ig
    i 1 g 1

    where 1 ig k equals 1 if ig equals k and 0 if not This derivative can be formulated fairly simply as follows There are a total of G in 1 Ki parameters in but only K G underlying parameters in De ne the matrix F with G rows and K columns Then let Fg j 1 if g j and 0 otherwise Thus there is exactly one 1 and K 1 0s in each row of F Let d be the G 1 vector of derivatives obtained by stacking di from 14 77 Then ln Lc F d The Hessian is likewise computed as a simple sum of terms We can construct it in blocks using Hi j 2 ln Lc i j
    T 0 s i j xit i x0 j jt t 1

    The asymptotic covariance matrix for is once again a sum of terms Est Asy Var V F HF 1
    14 4 3 GMM ESTIMATION

    All the preceding estimation techniques including the linear models in the earlier sections of this chapter can be obtained as GMM estimators Suppose that in the general 0 formulation of the model in 14 41 we allow for nonzero correlation between xit and is It will not always be present but we generalize the model to allow this correlation as a possibility Suppose as well that there are a set of instrumental variables zt such that E zt it 0 t 1 T and i 1 M 14 46

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    373

    We could allow a separate set of instrumental variables for each equation but it would needlessly complicate the presentation Under these assumptions the nonlinear FGLS and ML estimators above will be inconsistent But a relatively minor extension of the instrumental variables technique developed for the single equation case in Section 10 4 can be used instead The sample analog to 14 46 is 1 T
    T

    zt yit hi xt 0 i 1 M
    t 1

    If we use this result for each equation in the system one at a time then we obtain exactly the GMM estimator discussed in Section 10 4 But in addition to the ef ciency loss that results from not imposing the cross equation constraints in i we would also neglect the correlation between the disturbances Let Z i j Z 1 Z ijZ E 14 47 T T The GMM criterion for estimation in this setting is
    M M

    q
    i 1 j 1 M M

    yi hi X Z T Z i Z T Z
    i 1 j 1

    ij i j Z T Z

    y j h j X T 14 48



    ij i j Z T Z

    j T

    where Z i j Z T i j denotes the ijth block of the inverse of the matrix with the ij th block equal to Z i j Z T This matrix is laid out in full in Section 15 6 3 GMM estimation would proceed in several passes To compute any of the variance parameters we will require an initial consistent estimator of This step can be done with equation by equation nonlinear instrumental variables see Section 10 2 4 although if equations have parameters in common then a choice must be made as to which to use At the next step the familiar White or Newey West technique is used to compute block by block the matrix in 14 47 Since it is based on a consistent estimator of we assume this matrix need not be recomputed Now with this result in hand an iterative solution to the maximization problem in 14 48 can be sought for example using the methods of Appendix E The rst order conditions are q
    M M

    Xi0 Z T Z Wi j Z T i j Z j T 0
    i 1 j 1

    14 49

    Note again that the blocks of the inverse matrix in the center are extracted from the larger constructed matrix after inversion This brief discussion might understate the complexity of the optimization problem in 14 48 but that is inherent in the procedure At completion the asymptotic covariance matrix for the GMM estimator is estimated with VGMM 1 T
    M M 1

    Xi0 Z T Z Wi j Z T i j Z X0 T j
    i 1 j 1



    Greene 50240

    book

    June 19 2002

    10 4

    374

    CHAPTER 14 Systems of Regression Equations

    14 5

    SUMMARY AND CONCLUSIONS

    This chapter has surveyed use of the seemingly unrelated regressions model The SUR model is an application of the generalized regression model introduced in Chapter 10 The advantage of the SUR formulation is the rich variety of behavioral models that t into this framework We began with estimation and inference with the SUR model treating it essentially as a generalized regression The major difference between this set of results and the single equation model in Chapter 10 is practical While the SUR model is in principle a single equation GR model with an elaborate covariance structure special problems arise when we explicitly recognize its intrinsic nature as a set of equations linked by their disturbances The major result for estimation at this step is the feasible GLS estimator In spite of its apparent complexity we can estimate the SUR model by a straightforward two step GLS approach that is similar to the one we used for models with heteroscedasticity in Chapter 11 We also extended the SUR model to autocorrelation and heteroscedasticity as in Chapters 11 and 12 for the single equation Once again the multiple equation nature of the model complicates these applications Maximum likelihood is an alternative method that is useful for systems of demand equations This chapter examined a number of applications of the SUR model Much of the empirical literature in nance focuses on the capital asset pricing model which we considered in Section 14 2 5 Section 14 2 6 developed an important result on estimating systems in which some equations are derived from the set by excluding some of the variables The block of zeros case is useful in the VAR models used in causality testing in Section 19 6 5 Section 14 3 presented one of the most common recent applications of the seemingly unrelated regressions model the estimation of demand systems One of the signature features of this literature is the seamless transition from the theoretical models of optimization of consumers and producers to the sets of empirical demand equations derived from Roy s identity for consumers and Shephard s lemma for producers Key Terms and Concepts
    Autocorrelation Capital asset pricing model Concentrated log likelihood Demand system Exclusion restriction Expenditure system Feasible GLS Flexible functional form Generalized least squares GMM estimator Heteroscedasticity Homogeneity restriction Identical regressors Invariance of MLE Kronecker product Lagrange multiplier statistic Likelihood ratio statistic Maximum likelihood Multivariate regression Seemingly unrelated

    regressions
    Wald statistic

    Exercises 1 A sample of 100 observations produces the following sample data y1 1 y1 y1 150 y2 y2 550 y1 y2 260 y2 2

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    375

    The underlying bivariate regression model is y1 1 y2 2 a Compute the OLS estimate of and estimate the sampling variance of this estimator b Compute the FGLS estimate of and the sampling variance of the estimator Consider estimation of the following two equation model y1 1 1 y2 2 x 2 A sample of 50 observations produces the following moment matrix 1 1 50 y1 150 y2 50 x 100 y1 500 40 60 y2 x 100

    2

    90 50

    3

    a Write the explicit formula for the GLS estimator of 1 2 What is the asymptotic covariance matrix of the estimator b Derive the OLS estimator and its sampling variance in this model c Obtain the OLS estimates of 1 and 2 and estimate the sampling covariance matrix of the two estimates Use n instead of n 1 as the divisor to compute the estimates of the disturbance variances d Compute the FGLS estimates of 1 and 2 and the estimated sampling covariance matrix e Test the hypothesis that 2 1 The model y1 1 x1 1 y2 2 x2 2 satis es all the assumptions of the classical multivariate regression model All variables have zero means The following sample second moment matrix is obtained from a sample of 20 observations y1 y2 y1 20 6 y2 6 10 3 x1 4 x2 3 6 a b c d x1 x2 43 3 6 5 2 2 10

    Compute the FGLS estimates of 1 and 2 Test the hypothesis that 1 2 Compute the maximum likelihood estimates of the model parameters Use the likelihood ratio test to test the hypothesis in part b

    Greene 50240

    book

    June 19 2002

    10 4

    376

    CHAPTER 14 Systems of Regression Equations

    4

    Prove that in the model y 1 X1 1 1 y2 X2 2 2 generalized least squares is equivalent to equation by equation ordinary least squares if X1 X2 Does your result hold if it is also known that 1 2 Consider the two equation system y1 1 x1 y2 1 2 x2 3 x3 2

    5

    6

    Assume that the disturbance variances and covariance are known Now suppose that the analyst of this model applies GLS but erroneously omits x3 from the second equation What effect does this speci cation error have on the consistency of the estimator of 1 Consider the system y1 1 x 1 y2 2 2

    7

    The disturbances are freely correlated Prove that GLS applied to the system leads to the OLS estimates of 1 and 2 but to a mixture of the least squares slopes in the regressions of y1 and y2 on x as the estimator of What is the mixture To simplify the algebra assume with no loss of generality that x 0 For the model y1 1 x 1 y2 2 y3 3 2 3

    8

    assume that yi 2 yi 3 1 at every observation Prove that the sample covariance matrix of the least squares residuals from the three equations will be singular thereby precluding computation of the FGLS estimator How could you proceed in this case Continuing the analysis of Section 14 3 2 we nd that a translog cost function for one output and three factor inputs that does not impose constant returns to scale is
    1 ln C 1 ln p1 2 ln p2 3 ln p3 11 2 ln2 p1 12 ln p1 ln p2 1 1 13 ln p1 ln p3 22 2 ln2 p2 23 ln p2 ln p3 33 2 ln2 p3

    y1 ln Y ln p1 y2 ln Y ln p2 y3 ln Y ln p3
    1 y ln Y yy 2 ln2 Y c

    The factor share equations are S1 1 11 ln p1 12 ln p2 13 ln p3 y1 ln Y 1 S2 2 12 ln p1 22 ln p2 23 ln p3 y2 ln Y 2 S3 3 13 ln p1 23 ln p2 33 ln p3 y3 ln Y 3

    Greene 50240

    book

    June 19 2002

    10 4

    CHAPTER 14 Systems of Regression Equations

    377

    See Christensen and Greene 1976 for analysis of this model a The three factor shares must add identically to 1 What restrictions does this requirement place on the model parameters b Show that the adding up condition in 14 39 can be imposed directly on the model by specifying the translog model in C p3 p1 p3 and p2 p3 and dropping the third share equation See Example 14 5 Notice that this reduces the number of free parameters in the model to 10 c Continuing Part b the model as speci ed with the symmetry and equality restrictions has 15 parameters By imposing the constraints you reduce this number to 10 in the estimating equations How would you obtain estimates of the parameters not estimated directly The remaining parts of this exercise will require specialized software The E Views TSP Stata or LIMDEP programs noted in the preface are four that could be used All estimation is to be done using the data used in Section 14 3 1 d Estimate each of the three equations you obtained in Part b by ordinary least squares Do the estimates appear to satisfy the cross equation equality and symmetry restrictions implied by the theory e Using the data in Section 14 3 1 estimate the full system of three equations cost and the two independent shares imposing the symmetry and cross equation equality constraints f Using your parameter estimates compute the estimates of the elasticities in 14 40 at the means of the variables g Use a likelihood ratio statistic to test the joint hypothesis that yi 0 i 1 2 3 Hint Just drop the relevant variables from the model

    Greene 50240

    book

    June 19 2002

    10 10

    15

    SIMULTANEOUS EQUATIONS MODELS

    Q
    15 1 INTRODUCTION Although most of our work thus far has been in the context of single equation models even a cursory look through almost any economics textbook shows that much of the theory is built on sets or systems of relationships Familiar examples include market equilibrium models of the macroeconomy and sets of factor or commodity demand equations Whether one s interest is only in a particular part of the system or in the system as a whole the interaction of the variables in the model will have important implications for both interpretation and estimation of the model s parameters The implications of simultaneity for econometric estimation were recognized long before the apparatus discussed in this chapter was developed 1 The subsequent research in the subject continuing to the present is among the most extensive in econometrics This chapter considers the issues that arise in interpreting and estimating multipleequations models Section 15 2 describes the general framework used for analyzing systems of simultaneous equations Most of the discussion of these models centers on problems of estimation But before estimation can even be considered the fundamental question of whether the parameters of interest in the model are even estimable must be resolved This problem of identi cation is discussed in Section 15 3 Sections 15 4 to 15 7 then discuss methods of estimation Section 15 8 is concerned with speci cation tests In Section 15 9 the special characteristics of dynamic models are examined 15 2 FUNDAMENTAL ISSUES IN SIMULTANEOUS EQUATIONS MODELS

    In this section we describe the basic terminology and statistical issues in the analysis of simultaneous equations models We begin with some simple examples and then present a general framework
    15 2 1 ILLUSTRATIVE SYSTEMS OF EQUATIONS

    A familiar example of a system of simultaneous equations is a model of market equilibrium consisting of the following demand equation qd t 1 pt 2 xt d t supply equation qs t 1 pt s t equilibrium condition qd t qs t qt

    1 See

    for example Working 1926 and Haavelmo 1943

    378

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    379

    These equations are structural equations in that they are derived from theory and each purports to describe a particular aspect of the economy 2 Since the model is one of the joint determination of price and quantity they are labeled jointly dependent or endogenous variables Income x is assumed to be determined outside of the model which makes it exogenous The disturbances are added to the usual textbook description to obtain an econometric model All three equations are needed to determine the equilibrium price and quantity so the system is interdependent Finally since an equilibrium solution for price and quantity in terms of income and the disturbances is indeed implied unless 1 equals 1 the system is said to be a complete system of equations The completeness of the system requires that the number of equations equal the number of endogenous variables As a general rule it is not possible to estimate all the parameters of incomplete systems although it may be possible to estimate some of them Suppose that interest centers on estimating the demand elasticity 1 For simplicity assume that d and s are well behaved classical disturbances with E d t xt E s t xt 0
    2 2 2 E d t xt d E s t xt s2

    E d t s t xt E dt xt E s t xt 0 All variables are mutually uncorrelated with observations at different time periods Price quantity and income are measured in logarithms in deviations from their sample means Solving the equations for p and q in terms of x and d and s produces the reduced form of the model 2 x d s p 1 x v1 1 1 1 1 15 1 1 2 x 1 d 1 s q 2 x v 2 1 1 1 1 Note the role of the completeness requirement that 1 not equal 1 2 It follows that Cov p d d 1 1 and Cov p s s2 1 1 so neither the demand nor the supply equation satis es the assumptions of the classical regression model The price elasticity of demand cannot be consistently estimated by least squares regression of q on y and p This result is characteristic of simultaneous equations models Because the endogenous variables are all correlated with the disturbances the least squares estimators of the parameters of equations with endogenous variables on the right hand side are inconsistent 3 Suppose that we have a sample of T observations on p q and y such that
    2 plim 1 T x x x

    Since least squares is inconsistent we might instead use an instrumental variable estimator 4 The only variable in the system that is not correlated with the disturbances is x
    2 The

    distinction between structural and nonstructural models is sometimes drawn on this basis See for example Cooley and LeRoy 1985 failure of least squares is sometimes labeled simultaneous equations bias Section 5 4

    3 This 4 See

    Greene 50240

    book

    June 19 2002

    10 10

    380

    CHAPTER 15 Simultaneous Equations Models

    Consider then the IV estimator 1 q x p x This estimator has plim 1 plim 1 2 1 1 q x T 1 p x T 2 1 1

    Evidently the parameter of the supply curve can be estimated by using an instrumental variable estimator In the least squares regression of p on x the predicted values are p p x x x x It follows that in the instrumental variable regression the instrument is p That is 1 pq pp

    Since p p p p 1 is also the slope in a regression of q on these predicted values This interpretation de nes the two stage least squares estimator It would be desirable to use a similar device to estimate the parameters of the demand equation but unfortunately we have exhausted the information in the sample Not only does least squares fail to estimate the demand equation but without some further assumptions the sample contains no other information that can be used This example illustrates the problem of identi cation alluded to in the introduction to this chapter A second example is the following simple model of income determination
    Example 15 1 A Small Macroeconomic Model

    Consider the model consumption ct 0 1 yt 2 ct 1 t 1 investment demand i t 0 1 r t 2 yt yt 1 t 2 yt ct i t gt

    The model contains an autoregressive consumption function an investment equation based on interest and the growth in output and an equilibrium condition The model determines the values of the three endogenous variables ct i t and yt This model is a dynamic model In addition to the exogenous variables r t and gt it contains two predetermined variables ct 1 and yt 1 These are obviously not exogenous but with regard to the current values of the endogenous variables they may be regarded as having already been determined The deciding factor is whether or not they are uncorrelated with the current disturbances which we might assume The reduced form of this model is Act 0 1 2 0 1 1 1 r t 1 gt 2 1 2 ct 1 1 2 yt 1 1 2 t 1 1 t 2 Ai t 0 2 0 1 1 1 1 1 r t 2 gt 2 2 ct 1 2 1 1 yt 1 2 t 1 1 1 t 2 Ayt 0 0 1 r t gt 2 ct 1 2 yt 1 t 1 t 2 where A 1 1 2 Note that the reduced form preserves the equilibrium condition

    The preceding two examples illustrate systems in which there are behavioral equations and equilibrium conditions The latter are distinct in that even in an econometric model they have no disturbances Another model which illustrates nearly all the concepts to be discussed in this chapter is shown in the next example

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models Example 15 2 Klein s Model I

    381

    A widely used example of a simultaneous equations model of the economy is Klein s 1950 Model I The model may be written Ct 0 1 Pt 2 Pt 1 3 Wt Wt I t 0 1 Pt 2 Pt 1 3 K t 1
    p Wt p g

    1t 2t 3t

    consumption investment private wages equilibrium demand

    0 1 X t 2 X t 1 3 At

    X t Ct I t Gt Pt X t Tt Wt K t K t 1 I t
    p

    private pro ts capital stock

    The endogenous variables are each on the left hand side of an equation and are labeled on the right The exogenous variables are Gt government nonwage spending Tt indirect g business taxes plus net exports Wt government wage bill At time trend measured as years from 1931 and the constant term There are also three predetermined variables the lagged values of the capital stock private pro ts and total demand The model contains three behavioral equations an equilibrium condition and two accounting identities This model provides an excellent example of a small dynamic model of the economy It has also been widely used as a test ground for simultaneous equations estimators Klein estimated the parameters using data for 1921 to 1941 The data are listed in Appendix Table F15 1
    15 2 2 ENDOGENEITY AND CAUSALITY

    The distinction between exogenous and endogenous variables in a model is a subtle and sometimes controversial complication It is the subject of a long literature 5 We have drawn the distinction in a useful economic fashion at a few points in terms of whether a variable in the model could reasonably be expected to vary autonomously independently of the other variables in the model Thus in a model of supply and demand the weather variable in a supply equation seems obviously to be exogenous in a pure sense to the determination of price and quantity whereas the current price clearly is endogenous by any reasonable construction Unfortunately this neat classi cation is of fairly limited use in macroeconomics where almost no variable can be said to be truly exogenous in the fashion that most observers would understand the term To take a common example the estimation of consumption functions by ordinary least squares as we did in some earlier examples is usually treated as a respectable enterprise even though most macroeconomic models including the examples given here depart from a consumption function in which income is exogenous This departure has led analysts for better or worse to draw the distinction largely on statistical grounds The methodological development in the literature has produced some consensus on this subject As we shall see the de nitions formalize the economic characterization we drew earlier We will loosely sketch a few results here for purposes of our derivations to follow The interested reader is referred to the literature and forewarned of some challenging reading
    5 See

    for example Zellner 1979 Sims 1977 Granger 1969 and especially Engle Hendry and Richard 1983

    Greene 50240

    book

    June 19 2002

    10 10

    382

    CHAPTER 15 Simultaneous Equations Models

    Engle Hendry and Richard 1983 de ne a set of variables x t in a parameterized model to be weakly exogenous if the full model can be written in terms of a marginal probability distribution for x t and a conditional distribution for yt x t such that estimation of the parameters of the conditional distribution is no less ef cient than estimation of the full set of parameters of the joint distribution This case will be true if none of the parameters in the conditional distribution appears in the marginal distribution for x t In the present context we will need this sort of construction to derive reduced forms the way we did previously With reference to time series applications although the notion extends to cross sections as well variables x t are said to be predetermined in the model if x t is independent of all subsequent structural disturbances t s for s 0 Variables that are predetermined in a model can be treated at least asymptotically as if they were exogenous in the sense that consistent estimates can be obtained when they appear as regressors We used this result in Chapters 5 and 12 as well when we derived the properties of regressions containing lagged values of the dependent variable A related concept is Granger causality Granger causality a kind of statistical feedback is absent when f x t x t 1 yt 1 equals f x t x t 1 The de nition states that in the conditional distribution lagged values of yt add no information to explanation of movements of x t beyond that provided by lagged values of x t itself This concept is useful in the construction of forecasting models Finally if x t is weakly exogenous and if yt 1 does not Granger cause x t then x t is strongly exogenous
    15 2 3 A GENERAL NOTATION FOR LINEAR SIMULTANEOUS EQUATIONS MODELS6

    The structural form of the model is7 11 yt 1 21 yt 2 M1 yt M 11 xt 1 K1 xt K t 1 12 yt 1 22 yt 2 M2 yt M 12 xt 1 K2 xt K t 2 1 M yt 1 2 M yt 2 MM yt M 1 M xt 1 KM xt K t M There are M equations and M endogenous variables denoted y1 yM There are K exogenous variables x1 xK that may include predetermined values of y1 yM as well The rst element of x t will usually be the constant 1 Finally t 1 t M are the structural disturbances The subscript t will be used to index observations t 1 T 15 2

    6 We will be restricting our attention to linear models in this chapter Nonlinear systems occupy another strand

    of literature in this area Nonlinear systems bring forth numerous complications beyond those discussed here and are beyond the scope of this text Gallant 1987 Gallant and Holly 1980 Gallant and White 1988 Davidson and MacKinnon 1993 and Wooldridge 2002 provide further discussion
    7 For

    the present it is convenient to ignore the special nature of lagged endogenous variables and treat them the same as the strictly exogenous variables

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    383

    In matrix terms the system may be written y1 y2 21 22 2 M yM t M1 M2 MM x1 x2 11 12 1 M 11 12 1 M

    21 22 2 M xK t 1 2 M t K1 K2 KM

    or yt xt B t

    Each column of the parameter matrices is the vector of coef cients in a particular equation whereas each row applies to a speci c variable The underlying theory will imply a number of restrictions on and B One of the variables in each equation is labeled the dependent variable so that its coef cient in the model will be 1 Thus there will be at least one 1 in each column of This normalization is not a substantive restriction The relationship de ned for a given equation will be unchanged if every coef cient in the equation is multiplied by the same constant Choosing a dependent variable simply removes this indeterminacy If there are any identities then the corresponding columns of and B will be completely known and there will be no disturbance for that equation Since not all variables appear in all equations some of the parameters will be zero The theory may also impose other types of restrictions on the parameter matrices If is an upper triangular matrix then the system is said to be triangular In this case the model is of the form yt 1 f1 x t t 1 yt 2 f2 yt 1 x t t 2 yt M f M yt 1 yt 2 yt M 1 x t t M The joint determination of the variables in this model is recursive The rst is completely determined by the exogenous factors Then given the rst the second is likewise determined and so on

    Greene 50240

    book

    June 19 2002

    10 10

    384

    CHAPTER 15 Simultaneous Equations Models

    The solution of the system of equations determining yt in terms of xt and t is the reduced form of the model 11 12 1 M 21 22 2 M 1 M t yt x1 x2 xK t K1 xt B xt
    1

    K2



    KM

    t

    1

    vt

    For this solution to exist the model must satisfy the completeness condition for simultaneous equations systems must be nonsingular
    Example 15 3

    For the small model in Example 15 1 y c i y x 1 r g c 1 y 1 and

    Structure and Reduced Form



    1

    0 1 2

    1 1





    0

    0

    0



    0 1

    1

    0 1 0 0 1 B 0 2 0 0
    0 2 1 1 1 1 1 1 0 1 2 1


    1



    1

    1 2 1

    2 1 1 2

    1 1



    1

    1


    where

    1 0 2 0 1 1 0 0

    0 1 2 0 1

    2 1 2 2 2 2

    2 1 2



    2 1 1

    1 1 2 The completeness condition is that 1 and 2 do not sum to one

    The structural disturbances are assumed to be randomly drawn from an M variate distribution with E t x t 0 For the present we assume that E t s x t xs 0 t s Later we will drop this assumption to allow for heteroscedasticity and autocorrelation It will occasionally be useful to assume that t has a multivariate normal distribution but we shall postpone this assumption until it becomes necessary It may be convenient to retain the identities without disturbances as separate equations If so then one way to proceed with the stochastic speci cation is to place rows and columns of zeros in the appropriate places in It follows that the reduced form disturbances vt t 1 have E vt x t E vt vt xt This implies that
    1 1

    and

    E t t xt



    0 0
    1





    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    385

    The preceding formulation describes the model as it applies to an observation y x t at a particular point in time or in a cross section In a sample of data each joint observation will be one row in a data matrix y 1 x1 1 y2 x2 2 Y X E yT xT T In terms of the full set of T observations the structure is Y XB E with E E X 0 and E 1 T E E X

    Under general conditions we can strengthen this structure to plim 1 T E E

    An important assumption comparable with the one made in Chapter 5 for the classical regression model is plim 1 T X X Q a nite positive de nite matrix We also assume that plim 1 T X E 0 15 4 15 3

    This assumption is what distinguishes the predetermined variables from the endogenous variables The reduced form is Y X V where V E
    1

    0 15 5

    Combining the earlier results we have Y 1 plim X Y X V T V

    Q Q

    Q Q 0

    15 3

    THE PROBLEM OF IDENTIFICATION

    Solving the problem to be considered here the identi cation problem logically precedes estimation We ask at this point whether there is any way to obtain estimates of the parameters of the model We have in hand a certain amount of information upon which to base any inference about its underlying structure If more than one theory is consistent with the same data then the theories are said to be observationally equivalent and there is no way of distinguishing them The structure is said to be unidenti ed 8
    8A

    useful survey of this issue is Hsiao 1983

    Greene 50240

    book

    June 19 2002

    10 10

    386
    P

    CHAPTER 15 Simultaneous Equations Models

    P

    P S1 S2 S3

    S 3 2 1 D1 Q a
    FIGURE 15 1 Market Equilibria

    D3 D2 D1 Q b c D2 Q

    D3

    Example 15 4

    Observational Equivalence9

    The observed data consist of the market outcomes shown in Figure 15 1a We have no knowledge of the conditions of supply and demand beyond our belief that the data represent equilibria Unfortunately parts b and c of Figure 15 1 both show structures that is true underlying supply and demand curves which are consistent with the data in Figure 15 1a With only the data in Figure 15 1a we have no way of determining which of theories 15 1b or c is the right one Thus the structure underlying the data in Figure 15 1a is unidenti ed To suggest where our discussion is headed suppose that we add to the preceding the known fact that the conditions of supply were unchanged during the period over which the data were drawn This rules out 15 1c and identi es 15 1b as the correct structure Note how this scenario relates to Example 15 1 and to the discussion following that example

    The identi cation problem is not one of sampling properties or the size of the sample To focus ideas it is even useful to suppose that we have at hand an in nitesized sample of observations on the variables in the model Now with this sample and our prior theory what information do we have In the reduced form y t xt vt E vt vt xt

    the predetermined variables are uncorrelated with the disturbances Thus we can observe plim 1 T X X Q assumed see 15 3 plim 1 T X Y plim 1 T X X plim 1 T Y Y plim 1 T Therefore V Q V Q X V X
    1

    the matrix of reduced form coef cients is observable plim XX T plim XY T

    9 This

    example paraphrases the classic argument of Working 1926

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    387

    This estimator is simply the equation by equation least squares regression of Y on X Since is observable is also plim YY YX plim T T XX T
    1

    XY T

    This result should be recognized as the matrix of least squares residual variances and covariances Therefore and can be estimated consistently by least squares regression of Y on X

    The information in hand therefore consists of and whatever other nonsample information we have about the structure 10 Now can we deduce the structural parameters from the reduced form The correspondence between the structural and reduced form parameters is the relationships B
    1

    and

    E vv

    1



    1



    If were known then we could deduce B as and as It would appear therefore that our problem boils down to obtaining which makes sense If were known then we could rewrite 15 2 collecting the endogenous variables times their respective coef cients on the left hand side of a regression and estimate the remaining unknown coef cients on the predetermined variables by ordinary least squares 11 The identi cation question we will pursue can be posed as follows We can observe the reduced form We must deduce the structure from what we know about the reduced form If there is more than one structure that can lead to the same reduced form then we cannot say that we can estimate the structure Which structure would that be Suppose that the true structure is B Now consider a different structure y x B that is obtained by postmultiplying the rst structure by some nonsingular matrix F Thus F B BF F The reduced form that corresponds to this new structure is unfortunately the same as the one that corresponds to the old one B 1 BFF 1
    1





    and in the same fashion The false structure looks just like the true one at least in terms of the information we have Statistically there is no way we can tell them apart The structures are observationally equivalent Since F was chosen arbitrarily we conclude that any nonsingular transformation of the original structure has the same reduced form Any reason for optimism that we might have had should be abandoned As the model stands there is no means by which the structural parameters can be deduced from the reduced form The practical implication is that if the only information that we have is the reduced form parameters then the structural model is not estimable So how were we able to identify the models
    have not necessarily shown that this is all the information in the sample In general we observe the conditional distribution f yt xt which constitutes the likelihood for the reduced form With normally distributed disturbances this distribution is a function of See Section 15 6 2 With other distributions other or higher moments of the variables might provide additional information See for example Goldberger 1964 p 311 Hausman 1983 pp 402 403 and especially Riers l 1950
    11 This 10 We

    method is precisely the approach of the LIML estimator See Section 15 5 5

    Greene 50240

    book

    June 19 2002

    10 10

    388

    CHAPTER 15 Simultaneous Equations Models

    in the earlier examples The answer is by bringing to bear our nonsample information namely our theoretical restrictions Consider the following examples
    Example 15 5 Identi cation

    Consider a market in which q is quantity of Q p is price and z is the price of Z a related good We assume that z enters both the supply and demand equations For example Z might be a crop that is purchased by consumers and that will be grown by farmers instead of Q if its price rises enough relative to p Thus we would expect 2 0 and 2 0 So qd 0 1 p 2 z d qs 0 1 p 2 z s qd qs q The reduced form is q p 1 2 2 1 1 s 2 d 1 0 0 1 z 11 21 z q 1 1 1 1 1 1 2 2 s d 0 0 z 1 1 1 1 1 1 12 22 z p demand supply equilibrium

    With only four reduced form coef cients and six structural parameters it is obvious that there will not be a complete solution for all six structural parameters in terms of the four reduced parameters Suppose though that it is known that 2 0 farmers do not substitute the alternative crop for this one Then the solution for 1 is 21 22 After a bit of manipulation we also obtain 0 11 12 21 22 The restriction identi es the supply parameters But this step is as far as we can go Now suppose that income x rather than z appears in the demand equation The revised model is q 0 1 p 2 x 1 q 0 1 p 2 z 2 The structure is now q p 1 1 1 1 1 x

    0 z 2 0



    0 0 1 2



    2

    The reduced form is q p 1 x

    1 0 0 1 2 1 z 1 2



    0 0 1 2 2 2



    where 1 1 Every false structure has the same reduced form But in the coef cient matrix 0 f11 0 f12 2 f11 B BF 2 f21



    0 f12 0 f22 2 f12 2 f22



    if f12 is not zero then the imposter will have income appearing in the supply equation which our theory has ruled out Likewise if f21 is not zero then z will appear in the demand equation which is also ruled out by our theory Thus although all false structures have the

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    389

    same reduced form as the true one the only one that is consistent with our theory i e is admissible and has coef cients of 1 on q in both equations examine F is F I This transformation just produces the original structure The unique solutions for the structural parameters in terms of the reduced form parameters are 0 11 12 1 31 32 31 21 22 32 31 32 0 11 12 1 21 22 21 31 32 22 21 22

    2 22

    2 32

    The preceding discussion has considered two equivalent methods of establishing identi ability If it is possible to deduce the structural parameters from the known reduced form parameters then the model is identi ed Alternatively if it can be shown that no false structure is admissible that is satis es the theoretical restrictions then the model is identi ed 12
    15 3 1 THE RANK AND ORDER CONDITIONS FOR IDENTIFICATION

    It is useful to summarize what we have determined thus far The unknown structural parameters consist of an M M nonsingular matrix B a K M parameter matrix an M M symmetric positive de nite matrix The known reduced form parameters are a K M reduced form coef cients matrix an M M reduced form covariance matrix Simply counting parameters in the structure and reduced forms yields an excess of l M 2 KM 1 M M 1 KM 1 M M 1 M 2 2 2 which is as might be expected from the earlier results the number of unknown elements in Without further information identi cation is clearly impossible The additional information comes in several forms 1 Normalizations In each equation one variable has a coef cient of 1 This normalization is a necessary scaling of the equation that is logically equivalent to putting one variable on the left hand side of a regression For purposes of identi cation and some estimation methods the choice among the endogenous variables is arbitrary But at the time the model is formulated each equation will usually have some natural dependent variable The normalization does not identify the dependent variable in any formal or causal sense For example in a model of supply and demand both the demand
    12 For

    other interpretations see Amemiya 1985 p 230 and Gabrielsen 1978 Some deeper theoretical results on identi cation of parameters in econometric models are given by Bekker and Wansbeek 2001

    Greene 50240

    book

    June 19 2002

    10 10

    390

    CHAPTER 15 Simultaneous Equations Models

    equation Q f P x and the inverse demand equation P g Q x are appropriate speci cations of the relationship between price and quantity We note though the following With the normalizations there are M M 1 not M 2 undetermined values in and this many indeterminacies in the model to be resolved through nonsample information 2 Identities In some models variable de nitions or equilibrium conditions imply that all the coef cients in a particular equation are known In the preceding market example there are three equations but the third is the equilibrium condition Qd Qs Klein s Model I Example 15 3 contains six equations including two accounting identities and the equilibrium condition There is no question of identi cation with respect to identities They may be carried as additional equations in the model as we do with Klein s Model I in several later examples or built into the model a priori as is typical in models of supply and demand The substantive nonsample information that will be used in identifying the model will consist of the following 3 Exclusions The omission of variables from an equation places zeros in B and In Example 15 5 the exclusion of income from the supply equation served to identify its parameters 4 Linear restrictions Restrictions on the structural parameters may also serve to rule out false structures For example a long standing problem in the estimation of production models using time series data is the inability to disentangle the effects of economies of scale from those of technological change In some treatments the solution is to assume that there are constant returns to scale thereby identifying the effects due to technological change 5 Restrictions on the disturbance covariance matrix In the identi cation of a model these are similar to restrictions on the slope parameters For example if the previous market model were to apply to a microeconomic setting then it would probably be reasonable to assume that the structural disturbances in these supply and demand equations are uncorrelated Section 15 3 3 shows a case in which a covariance restriction identi es an otherwise unidenti ed model To formalize the identi cation criteria we require a notation for a single equation The coef cients of the jth equation are contained in the jth columns of and B The jth equation is y
    j

    x Bj j

    15 6

    For convenience we have dropped the observation subscript In this equation we know that 1 one of the elements in j is one and 2 some variables that appear elsewhere in the model are excluded from this equation Table 15 1 de nes the notation used to incorporate these restrictions in 15 6 Equation j may be written y j Y j j Y x j j x j j j j j

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    391

    TABLE 15 1

    Components of Equation j Dependent Variable yj
    Endogenous Variables Exogenous Variables

    Included Excluded

    Y j Mj variables Y M variables j j

    x j K j variables x K variables j j

    The number of equations is Mj M 1 M j The number of exogenous variables is K j K K j The coef cient on y j in equation j is 1 s will always be associated with excluded variables

    The exclusions imply that 0 and 0 Thus j j
    j

    1 j

    0

    and

    B j j

    0

    Note the sign convention For this equation we partition the reduced form coef cient matrix in the same fashion 1 Mj M j j j j v j x j j j j B which implies that B The jth column of this matrix equation applies to the jth equation
    j 1

    y j

    Yj

    Y x j j

    Vj

    V j

    K j rows K rows j

    15 7

    The reduced form coef cient matrix is

    B j 1 j j 0 0 K j equations K equations j

    Inserting the parts from Table 15 1 yields j j
    j j

    j j

    Now extract the two subequations j j 1
    j j j j

    j 0

    15 8 15 9

    Mj

    The solution for B in terms of that we observed at the beginning of this discussion is in 15 8 Equation 15 9 may be written
    j j

    j

    15 10

    This system is K equations in Mj unknowns If they can be solved for j then 15j 8 gives the solution for j and the equation is identi ed For there to be a solution

    Greene 50240

    book

    June 19 2002

    10 10

    392

    CHAPTER 15 Simultaneous Equations Models

    there must be at least as many equations as unknowns which leads to the following condition

    DEFINITION 15 1 Order Condition for Identi cation of Equation j K Mj j 15 11

    The number of exogenous variables excluded from equation j must be at least as large as the number of endogenous variables included in equation j

    The order condition is only a counting rule It is a necessary but not suf cient condition for identi cation It ensures that 15 10 has at least one solution but it does not ensure that it has only one solution The suf cient condition for uniqueness follows

    DEFINITION 15 2 Rank Condition for Identi cation rank j
    j

    rank

    j

    Mj

    This condition imposes a restriction on a submatrix of the reduced form coef cient matrix

    The rank condition ensures that there is exactly one solution for the structural parameters given the reduced form parameters Our alternative approach to the identi cation problem was to use the prior restrictions on B to eliminate all false structures An equivalent condition based on this approach is simpler to apply and has more intuitive appeal We rst rearrange the structural coef cients in the matrix 1 A1 A2 j 0 A3 a j A j A 15 12 B j A4 0 A5 The jth column in a false structure F BF i e the imposter for our equation j would be f j Bf j where f j is the jth column of F This new jth equation is to be built up as a linear combination of the old one and the other equations in the model Thus partitioning as previously 1 A1 1 A2 0 j j f A3 1 0 aj 0 f A j
    j 4

    0

    A5

    0

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    393

    If this hybrid is to have the same variables as the original then it must have nonzero elements in the same places which can be ensured by taking f 0 1 and zeros in the same positions as the original a j Extracting the third and fth blocks of rows if a j is to be admissible then it must meet the requirement A3 1 f 0 A5 This equality is not possible if the M K M 1 matrix in brackets has full column j j rank so we have the equivalent rank condition rank A3 M 1 A5

    The corresponding order condition is that the matrix in brackets must have at least as many rows as columns Thus M K M 1 But since M Mj M 1 this j j j condition is the same as the order condition in 15 11 The equivalence of the two rank conditions is pursued in the exercises The preceding provides a simple method for checking the rank and order conditions We need only arrange the structural parameters in a tableau and examine the relevant submatrices one at a time A3 and A5 are the structural coef cients in the other equations on the variables that are excluded from equation j One rule of thumb is sometimes useful in checking the rank and order conditions of a model If every equation has its own predetermined variable the entire model is identi ed The proof is simple and is left as an exercise For a nal example we consider a somewhat larger model
    Example 15 6 Identi cation of Klein s Model I

    The structural coef cients in the six equations of Klein s Model I transposed and multiplied by 1 for convenience are listed in Table 15 2 Identi cation of the consumption function requires that the matrix A3 A5 have rank 5 The columns of this matrix are contained in boxes in the table None of the columns indicated by arrows can be formed as linear combinations of the other columns so the rank condition is satis ed Veri cation of the rank and order conditions for the other two equations is left as an exercise

    It is unusual for a model to pass the order but not the rank condition Generally either the conditions are obvious or the model is so large and has so many predetermined

    TABLE 15 2

    Klein s Model I Structural Coef cients
    B
    p g

    C

    I

    W

    X

    P

    K

    1

    W

    G

    T

    A

    P 1

    K 1

    X 1

    C I Wp X P K

    1 0 0 1 0 0

    0 1 0 1 0 1

    3 0 1 0 1 0 A3

    0 0 1 1 1 0

    1 1 0 0 1 0

    0 0 0 0 0 1

    0 0 0 0 0 0

    3 0 0 0 0 0

    0 0 0 1 0 0 A5

    0 0 0 0 1 0

    0 0 3 0 0 0

    2 2 0 0 0 0

    0 3 0 0 0 1

    0 0 2 0 0 0

    Greene 50240

    book

    June 19 2002

    10 10

    394

    CHAPTER 15 Simultaneous Equations Models

    variables that the conditions are met trivially In practice it is simple to check both conditions for a small model For a large model frequently only the order condition is veri ed We distinguish three cases 1 2 3 Underidenti ed K Mj or rank condition fails j Exactly identi ed K Mj and rank condition is met j Overidenti ed K Mj and rank condition is met j
    IDENTIFICATION THROUGH OTHER NONSAMPLE INFORMATION

    15 3 2

    The rank and order conditions given in the preceding section apply to identi cation of an equation through exclusion restrictions Intuition might suggest that other types of nonsample information should be equally useful in securing identi cation To take a speci c example suppose that in Example 15 5 it is known that 2 equals 2 not 0 The second equation could then be written as qs 2z q 0 1 p z 2 s j But we know that 0 so the supply equation is identi ed by this restriction As j this example suggests a linear restriction on the parameters within an equation is for identi cation purposes essentially the same as an exclusion 13 By an appropriate manipulation that is by solving out the restriction we can turn the restriction into one more exclusion The order condition that emerges is n j M 1 where n j is the total number of restrictions Since M 1 Mj M and n j is the number j of exclusions plus r j the number of additional restrictions this condition is equivalent to r j K Mj Mj Mj j or r j K Mj j This result is the same as 15 11 save for the addition of the number of restrictions which is the result suggested previously
    15 3 3 IDENTIFICATION THROUGH COVARIANCE RESTRICTIONS THE FULLY RECURSIVE MODEL

    The observant reader will have noticed that no mention of is made in the preceding discussion To this point all the information provided by is used in the estimation of for given the relationship between and is one to one Recall that But if restrictions are placed on then there is more information in than is needed for estimation of The excess information can be used instead to help infer the elements
    13 The

    analysis is more complicated if the restrictions are across equations that is involve the parameters in more than one equation Kelly 1975 contains a number of results and examples

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    395

    in A useful case is that of zero covariances across the disturbances 14 Once again it is most convenient to consider this case in terms of a false structure If the structure is B then a false structure would have parameters B F BF F F

    If any of the elements in are zero then the false structure must preserve those restrictions to be admissible For example suppose that we specify that 12 0 Then it must also be true that 12 f1 f2 0 where f1 and f2 are columns of F As such there is a restriction on F that may identify the model The fully recursive model is an important special case of the preceding result A triangular system is y1 1 x 1 y2 12 y1 2 x 2 yM 1 M y1 2 M y2 M 1 M yM 1 M x M We place no restrictions on B The rst equation is identi ed since it is already in reduced form But for any of the others linear combinations of it and the ones above it involve the same variables Thus we conclude that without some identifying restrictions only the parameters of the rst equation in a triangular system are identi ed But suppose that is diagonal Then the entire model is identi ed as we now prove As usual we attempt to nd a false structure that satis es the restrictions of the model The jth column of F f j is the coef cients in a linear combination of the equations that will be an imposter for equation j Many f j s are already precluded 1 2 f1 must be the rst column of an identity matrix The rst equation is identi ed and normalized on y1 In all remaining columns of F all elements below the diagonal must be zero since an equation can only involve the ys in it or in the equations above it

    Without further restrictions any upper triangular F is an admissible transformation But with a diagonal we have more information Consider the second column Since must be diagonal f1 f2 0 But given f1 in 1 above f1 f2 11 f12 0 so f12 0 The second column of F is now complete and is equal to the second column of I Continuing in the same manner we nd that f 1 f3 0 and f2 f3 0

    will suf ce to establish that f3 is the third column of I In this fashion it can be shown that the only admissible F is F I which was to be shown With upper triangular M M 1 2 unknown parameters remained That is exactly the number of restrictions placed on when it was assumed to be diagonal
    14 More

    general cases are discussed in Hausman 1983 and Judge et al 1985

    Greene 50240

    book

    June 19 2002

    10 10

    396

    CHAPTER 15 Simultaneous Equations Models

    15 4

    METHODS OF ESTIMATION

    It is possible to estimate the reduced form parameters and consistently by ordinary least squares But except for forecasting y given x these are generally not the parameters of interest B and are The ordinary least squares OLS estimators of the structural parameters are inconsistent ostensibly because the included endogenous variables in each equation are correlated with the disturbances Still it is at least of passing interest to examine what is estimated by ordinary least squares particularly in view of its widespread use despite its inconsistency Since the proof of identi cation was based on solving for B and from and one way to proceed is to apply our nding to the sample estimates P and W This indirect least squares approach is feasible but inef cient Worse there will usually be more than one possible estimator and no obvious means of choosing among them There are two approaches for direct estimation both based on the principle of instrumental variables It is possible to estimate each equation separately using a limited information estimator But the same principle that suggests that joint estimation brings ef ciency gains in the seemingly unrelated regressions setting of the previous chapter is at work here so we shall also consider full information or system methods of estimation

    15 5

    SINGLE EQUATION LIMITED INFORMATION ESTIMATION METHODS

    Estimation of the system one equation at a time has the bene t of computational simplicity But because these methods neglect information contained in the other equations they are labeled limited information methods
    15 5 1 ORDINARY LEAST SQUARES

    For all T observations the nonzero terms in the jth equation are yj Yj j Xj j j Zj j j The M reduced form equations are Y X V For the included endogenous variables Y j the reduced forms are the Mj appropriate columns of and V written Yj X
    j

    Vj

    15 13

    Note that j is the middle part of shown in 15 7 Likewise V j is Mj columns of V E 1 This least squares estimator is d j Z j Z j 1 Z j y j j YjYj XjYj YjXj XjXj
    1

    Yj j Xj j



    None of the terms in the inverse matrix converge to 0 Although plim 1 T X j j 0 plim 1 T Y j j is nonzero which means that both parts of d j are inconsistent This is the simultaneous equations bias of least squares Although we can say with certainty that d j is inconsistent we cannot state how serious this problem is OLS does

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    397

    have the virtue of computational simplicity although with modern software this virtue is extremely modest For better or worse OLS is a very commonly used estimator in this context We will return to this issue later in a comparison of several estimators An intuitively appealing form of simultaneous equations model is the triangular system that we examined in Section 15 5 3 1 y1 x 1 2 y2 x 2 12 y1 1 2

    3 y3 x 3 13 y1 23 y2 3 and so on If is triangular and is diagonal so that the disturbances are uncorrelated then the system is a fully recursive model No restrictions are placed on B It is easy to see that in this case the entire system may be estimated consistently and as we shall show later ef ciently by ordinary least squares The rst equation is a classical regression model In the second equation Cov y1 2 Cov x 1 1 2 0 so it too may be estimated by ordinary least squares Proceeding in the same fashion to 3 it is clear that y1 and 3 are uncorrelated Likewise if we substitute 1 in 2 and then the result for y2 in 3 then we nd that y2 is also uncorrelated with 3 Continuing in this way we nd that in every equation the full set of right hand variables is uncorrelated with the respective disturbance The result is that the fully recursive model may be consistently estimated using equation by equation ordinary least squares In the more general case in which is not diagonal the preceding argument does not apply
    15 5 2 ESTIMATION BY INSTRUMENTAL VARIABLES

    In the next several sections we will discuss various methods of consistent and ef cient estimation As will be evident quite soon there is a surprisingly long menu of choices It is a useful result that all of the methods in general use can be placed under the umbrella of instrumental variable IV estimators Returning to the structural form we rst consider direct estimation of the jth equation yj Yj j Xj j j Zj j j 15 14

    As we saw previously the OLS estimator of j is inconsistent because of the correlation of Z j and j A general method of obtaining consistent estimates is the method of instrumental variables See Section 5 4 Let W j be a T Mj K j matrix that satis es the requirements for an IV estimator plim 1 T W j Z j plim 1 T W j W j Then the IV estimator j IV W j Z j 1 W j y j
    wz

    a nite nonsingular matrix a positive de nite matrix

    15 15a 15 15b 15 15c

    plim 1 T W j j 0
    ww

    Greene 50240

    book

    June 19 2002

    10 10

    398

    CHAPTER 15 Simultaneous Equations Models

    will be consistent and have asymptotic covariance matrix
    1 1 jj 1 Asy Var j IV plim W j Z j W Wj T T Tj j j 1 1 wz w w zw T

    1 Z Wj Tj

    1

    15 16

    A consistent estimator of j j is y j Z j j IV y j Z j j IV 15 17 T which is the familiar sum of squares of the estimated disturbances A degrees of freedom correction for the denominator T Mj K j is sometimes suggested Asymptotically the correction is immaterial Whether it is bene cial in a small sample remains to be settled The resulting estimator is not unbiased in any event as it would be in the classical regression model In the interest of simplicity only we shall omit the degrees of freedom correction in what follows Current practice in most applications is to make the correction The various estimators that have been developed for simultaneous equations models are all IV estimators They differ in the choice of instruments and in whether the equations are estimated one at a time or jointly We divide them into two classes limited information or full information on this basis jj
    15 5 3 TWO STAGE LEAST SQUARES

    The method of two stage least squares is the most common method used for estimating simultaneous equations models We developed the full set of results for this estimator in Section 5 4 By merely changing notation slightly the results of Section 5 4 are exactly the derivation of the estimator we will describe here Thus you might want to review this section before continuing The two stage least squares 2SLS method consists of using as the instruments for Y j the predicted values in a regression of Y j on all the x s in the system Y j X X X 1 X Y j XP j 15 18

    It can be shown that absent heteroscedasticity or autocorrelation this produces the most ef cient IV estimator that can be formed using only the columns of X Note the emulation of E Y j XII j in the result The 2SLS estimator is thus j 2SLS YjYj XjYj YjXj XjXj
    1

    Yjyj Xjyj



    15 19

    Before proceeding it is important to emphasize the role of the identi cation con dition in this result In the matrix Y j X j which has Mj K j columns all columns are linear functions of the K columns of X There exist at most K linearly independent combinations of the columns of X If the equation is not identi ed then Mj K j is greater than K and Y j X j will not have full column rank In this case the 2SLS estimator cannot be computed If however the order condition but not the rank condition is met then although the 2SLS estimator can be computed it is not a consistent estimator There are a few useful simpli cations First since X X X 1 X I M is

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    399

    idempotent Y j Y j Y j Y j Second X j X X X 1 X X j implies that X j Y j X j Y j Thus 15 19 can also be written j 2SLS YjYj XjYj YjXj XjXj
    1

    Yjyj Xjyj



    15 20

    The 2SLS estimator is obtained by ordinary least squares regression of y j on Y j and X j Thus the name stems from the two regressions in the procedure 1 2 Stage 1 Obtain the least squares predictions from regression of Y j on X Stage 2 Estimate j by least squares regression of y j on Y j and X j

    A direct proof of the consistency of the 2SLS estimator requires only that we establish that it is a valid IV estimator For 15 15a we require plim YjYj T XjYj T YjXj T XjXj T plim P j X XII j V j T X j XII j V j T PjX Xj T XjXj T

    to be a nite nonsingular matrix We have used 15 13 for Y j which is a continuous function of P j which has plim P j j The Slutsky theorem thus allows us to substitute j for P j in the probability limit That the parts converge to a nite matrix follows from 15 3 and 15 5 It will be nonsingular if j has full column rank which in turn will be true if the equation is identi ed 15 For 15 15b we require that plim 1 Yj j 0 0 T Xj j
    1

    The second part is assumed in 15 4 For the rst by direct substitution plim YjX 1 Y X X X 1 X j plim Tj T XX T X j T

    The third part on the right converges to zero whereas the other two converge to nite matrices which con rms the result Since j 2SLS is an IV estimator we can just invoke Theorem 5 3 for the asymptotic distribution A proof of asymptotic ef ciency requires the establishment of the benchmark which we shall do in the discussion of the MLE As a nal shortcut that is useful for programming purposes we note that if X j is regressed on X then a perfect t is obtained so X j X j Using the idempotent matrix I M 15 20 becomes j 2SLS Thus j 2SLS Z j Z j 1 Z j y j Z j X X X 1 X Z j 1 Z j X X X 1 X y j 15 21 Y j I M Y j X j I M Y j Y j I M X j X j I M X j
    1

    Y j I M y j X j I M y j



    where all columns of Z j are obtained as predictions in a regression of the corresponding
    15 Schmidt

    1976 pp 150 151 provides a proof of this result

    Greene 50240

    book

    June 19 2002

    10 10

    400

    CHAPTER 15 Simultaneous Equations Models

    column of Z j on X This equation also results in a useful simpli cation of the estimated asymptotic covariance matrix Est Asy Var j 2SLS j j Z j Z j 1 It is important to note that j j is estimated by jj using the original data not Z j
    15 5 4 GMM ESTIMATION

    y j Z j j y j Z j j T

    The GMM estimator in Section 10 4 is with a minor change of notation precisely the set of procedures we have been using here Using this method however will allow us to generalize the covariance structure for the disturbances We assume that y jt z j t j jt where z jt Y jt x jt we use the capital Y jt to denote the Lj included endogenous variables Thus far we have assumed that jt in the jth equation is neither heteroscedastic nor autocorrelated There is no need to impose those assumptions at this point Autocorrelation in the context of a simultaneous equations model is a substantial complication however For the present we will consider the heteroscedastic case only The assumptions of the model provide the orthogonality conditions E x t jt E x t y jt z j t j 0 If x t is taken to be the full set of exogenous variables in the model then we obtain the criterion for the GMM estimator q e zt j X X e zt j W j1 j T T

    m j W j1 m j j where m j 1 T
    T

    x t y jt z j t j
    t 1

    and

    W j1 the GMM weighting matrix j

    Once again this is precisely the estimator de ned in Section 10 4 see 10 17 If the disturbances are assumed to be homoscedastic and nonautocorrelated then the optimal weighting matrix will be an estimator of the inverse of W j j Asy Var T m j plim plim plim 1 T 1 T
    T

    x t xt y jt z j t j 2
    t 1 T

    j j x t xt
    t 1

    j j X X T

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    401

    The constant j j is irrelevant to the solution If we use X X 1 as the weighting matrix then the GMM estimator that minimizes q is the 2SLS estimator The extension that we can obtain here is to allow for heteroscedasticity of unknown form There is no need to rederive the earlier result If the disturbances are heteroscedastic then W j j plim 1 T
    T

    j j t x t xt plim
    t 1

    X

    jjX

    T



    The weighting matrix can be estimated with White s consistent estimator see 10 23 if a consistent estimator of j is in hand with which to compute the residuals One is since 2SLS ignoring the heteroscedasticity is consistent albeit inef cient The conclusion then is that under these assumptions there is a way to improve on 2SLS by adding another step The name 3SLS is reserved for the systems estimator of this sort When choosing between 2 5 stage least squares and Davidson and MacKinnon s suggested heteroscedastic 2SLS or H2SLS we chose to opt for the latter The estimator is based on the initial two stage least squares procedure Thus j H2SLS Z j X S0 j j 1 X Z j 1 Z j X S0 j j 1 X y j where
    T

    S0 j j
    t 1

    x t xt y jt z j t j 2SLS 2

    The asymptotic covariance matrix is estimated with Est Asy Var j H2SLS Z j X S0 j j 1 X Z j 1 Extensions of this estimator were suggested by Cragg 1983 and Cumby Huizinga and Obstfeld 1983

    15 5 5

    LIMITED INFORMATION MAXIMUM LIKELIHOOD AND THE K CLASS OF ESTIMATORS

    The limited information maximum likelihood LIML estimator is based on a single equation under the assumption of normally distributed disturbances LIML is ef cient among single equation estimators A full lengthy derivation of the log likelihood is provided in Theil 1971 and Davidson and MacKinnon 1993 We will proceed to the practical aspects of this estimator and refer the reader to these sources for the background formalities A result that emerges from the derivation is that the LIML estimator has the same asymptotic distribution as the 2SLS estimator and the latter does not rely on an assumption of normality This raises the question why one would use the LIML technique given the availability of the more robust and computationally simpler alternative Small sample results are sparse but they would favor 2SLS as well See Phillips 1983 The one signi cant virtue of LIML is its invariance to the normalization of the equation Consider an example in a system of equations y1 y2 2 y3 3 x1 1 x2 2 1

    Greene 50240

    book

    June 19 2002

    10 10

    402

    CHAPTER 15 Simultaneous Equations Models

    An equivalent equation would be y2 y1 1 2 y3 3 2 x1 1 2 x2 2 2 1 1 2 y1 1 y3 3 x1 1 x2 2 1 The parameters of the second equation can be manipulated to produce those of the rst But as you can easily verify the 2SLS estimator is not invariant to the normalization of the equation 2SLS would produce numerically different answers LIML would give the same numerical solutions to both estimation problems suggested above The LIML or least variance ratio estimator can be computed as follows 16 Let W 0 E 0 E0 j j j where Y0 y j Y j j and E0 M j Y0 I X j X j X j 1 X j Y0 j j j 15 23 15 22

    Each column of E0 is a set of least squares residuals in the regression of the correj sponding column of Y0 on X j that is the exogenous variables that appear in the jth j equation Thus W0 is the matrix of sums of squares and cross products of these residuals j De ne W1 E1 E1 Y0 I X X X 1 X Y0 j j j j j 15 24

    That is W1 is de ned like W0 except that the regressions are on all the x s in the model j j not just the ones in the jth equation Let 1 smallest characteristic root of W1 j
    1

    W0 j

    15 25

    This matrix is asymmetric but all its roots are real and greater than or equal to 1 Depending on the available software it may be more convenient to obtain the identical smallest root of the symmetric matrix D W1 1 2 W0 W1 1 2 Now partition W0 into j j j j w 0 j w0 j j 0 1 Wj corresponding to y j Y j and partition W j likewise Then with w0 W0 j j j these parts in hand j LIML W0 j 1 W1 j j j and j LIML X j X j 1 X j y j Y j j LIML Note that j is estimated by a simple least squares regression See 3 18 The asymptotic covariance matrix for the LIML estimator is identical to that for the 2SLS
    16 The

    1

    w 0 1 w1 j j

    15 26

    least variance ratio estimator is derived in Johnston 1984 The LIML estimator was derived by Anderson and Rubin 1949 1950

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    403

    estimator 17 The implication is that with normally distributed disturbances 2SLS is fully ef cient The k class of estimators is de ned by the following form j k Y j Y j kV j V j XjYj YjXj XjXj
    1

    Y j y j kV j v j Xjyj



    We have already considered three members of the class OLS with k 0 2SLS with k 1 and it can be shown LIML with k 1 This last result follows from 15 26 There have been many other k class estimators derived Davidson and MacKinnon 1993 pp 649 651 and Mariano 2001 give discussion It has been shown that all members of the k class for which k converges to 1 at a rate faster than 1 n have the same asymptotic distribution as that of the 2SLS estimator that we examined earlier These are largely of theoretical interest given the pervasive use of 2SLS or OLS save for an important consideration The large sample properties of all k class estimator estimators are the same but the nite sample properties are possibly very different Davidson and MacKinnon 1993 and Mariano 1982 2001 suggest that some evidence favors LIML when the sample size is small or moderate and the number of overidentifying restrictions is relatively large
    15 5 6 TWO STAGE LEAST SQUARES IN MODELS THAT ARE NONLINEAR IN VARIABLES

    The analysis of simultaneous equations becomes considerably more complicated when the equations are nonlinear Amemiya presents a general treatment of nonlinear models 18 A case that is broad enough to include many practical applications is the one analyzed by Kelejian 1971 y j 1 j f1 j y x 2 j f2 j y x X j j j 19 which is an extension of 7 4 Ordinary least squares will be inconsistent for the same reasons as before but an IV estimator if one can be devised should have the familiar properties Because of the nonlinearity it may not be possible to solve for the reducedform equations assuming that they exist hi j x E fi j x Kelejian shows that 2SLS based on a Taylor series approximation to hi j using the linear terms higher powers and cross products of the variables in x will be consistent The analysis of 2SLS presented earlier then applies to the Z j consisting of f1 j f2 j X j The alternative approach of using tted values for y appears to be inconsistent See Kelejian 1971 and Goldfeld and Quandt 1968 In a linear model if an equation fails the order condition then it cannot be estimated by 2SLS This statement is not true of Kelejian s approach however since taking higher powers of the regressors creates many more linearly independent instrumental variables If an equation in a linear model fails the rank condition but not the order
    17 This is proved by showing that both estimators are members of the k class of estimators all of which have

    the same asymptotic covariance matrix Details are given in Theil 1971 and Schmidt 1976
    18 Amemiya

    1985 pp 245 265 See as well Wooldridge 2002 ch 9

    19 2SLS for models that are nonlinear in the parameters is discussed in Chapters 10 and 11 in connection with

    GMM estimators

    Greene 50240

    book

    June 19 2002

    10 10

    404

    CHAPTER 15 Simultaneous Equations Models

    condition then the 2SLS estimates can be computed in a nite sample but will fail to exist asymptotically because X j will have short rank Unfortunately to the extent that Kelejian s approximation never exactly equals the true reduced form unless it happens to be the polynomial in x unlikely this built in control need not be present even asymptotically Thus although the model in Example 15 7 below is unidenti ed computation of Kelejian s 2SLS estimator appears to be routine
    Example 15 7 A Nonlinear Model of Industry Structure

    The following model of industry structure and performance was estimated by Strickland and Weiss 1976 Note that the square of the endogenous variable C appears in the rst equation A 0 1 M 2 Cd 3 C 4 C 2 5 Gr 6 D 1 C 0 1 A 2 MES 2 M 0 1 K 2 Gr 3 C 4 Gd 5 A 6 MES 3 S A C Cd MES industry sales advertising S concentration consumer demand S ef cient scale S M D Gr K Gd price cost margin durable goods industry 0 1 industry growth rate capital stock S geographic dispersion

    Since the only restrictions are exclusions we may check identi cation by the rule rank A3 A5 M 1 discussed in Section 15 3 1 Identi cation of the rst equation requires A3 A5 2 6 0 1 0 4

    to have rank two which it does unless 2 0 Thus the rst equation is identi ed by the presence of the scale variable in the second equation It is easily seen that the second equation is overidenti ed But for the third A3 A5 4 0 2 0 6 0

    which has rank one not two The third equation is not identi ed It passes the order condition but fails the rank condition The failure of the third equation is obvious on inspection There is no variable in the second equation that is not in the third Nonetheless it was possible to obtain two stage least squares estimates because of the nonlinearity of the model and the results discussed above

    15 6

    SYSTEM METHODS OF ESTIMATION

    We may formulate the full system of equations as Z1 0 0 1 1 y1 y 0 Z2 0 2 2 2 yM 0 0 ZM M M or y Z

    15 27

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    405

    where E X 0 and E X I 15 28

    see 14 3 The least squares estimator d Z Z 1 Z y is equation by equation ordinary least squares and is inconsistent But even if ordinary least squares were consistent we know from our results for the seemingly unrelated regressions model in the previous chapter that it would be inef cient compared with an estimator that makes use of the cross equation correlations of the disturbances For the rst issue we turn once again to an IV estimator For the second as we did in Chapter 14 we use a generalized least squares approach Thus assuming that the matrix of instrumental variables W satis es the requirements for an IV estimator a consistent though inef cient estimator would be IV W Z 1 W y 15 29

    Analogous to the seemingly unrelated regressions model a more ef cient estimator would be based on the generalized least squares principle IV GLS W
    1

    I Z 1 W

    1

    I y

    15 30

    or where W j is the set of instrumental variables for the jth equation 1 M 1 j 11 12 W1 Z2 1 M W1 Z M W 1 Z1 j 1 W1 y j 21 W 2 Z1 22 W2 Z2 2 M W2 Z M M 1 2 j W2 y j j IV GLS M M1 W M Z1 M2 W M Z2 MM W M Z M Mj W M y j j 1 Three techniques are generally used for joint estimation of the entire system of equations three stage least squares GMM and full information maximum likelihood
    15 6 1 THREE STAGE LEAST SQUARES

    Consider the IV estimator formed from

    Z1 0 W Z diag X X X 1 X Z1 X X X 1 X Z M 0

    0 2 Z 0



    0 0 ZM

    The IV estimator IV Z Z 1 Z y is simply equation by equation 2SLS We have already established the consistency of 2SLS By analogy to the seemingly unrelated regressions model of Chapter 14 however we would expect this estimator to be less ef cient than a GLS estimator A natural

    Greene 50240

    book

    June 19 2002

    10 10

    406

    CHAPTER 15 Simultaneous Equations Models

    candidate would be 3SLS Z
    1

    I Z 1 Z

    1

    I y

    For this estimator to be a valid IV estimator we must establish that 1 Z 1 I 0 T which is M sets of equations each one of the form plim plim 1 T
    M

    i j Z j j 0
    j 1

    Each is the sum of vectors all of which converge to zero as we saw in the development of the 2SLS estimator The second requirement that 1 Z 1 I Z 0 T and that the matrix be nonsingular can be established along the lines of its counterpart for 2SLS Identi cation of every equation by the rank condition is suf cient But see Mariano 2001 on the subject of weak instruments Once again using the idempotency of I M we may also interpret this estimator as a GLS estimator of the form plim 3SLS Z
    1

    I Z 1 Z
    1

    1

    I y

    15 31

    The appropriate asymptotic covariance matrix for the estimator is Asy Var 3SLS Z I Z 1 15 32

    where Z diag X j X j This matrix would be estimated with the bracketed inverse matrix in 15 31 Using sample data we nd that Z may be estimated with Z The remaining dif culty is to obtain an estimate of In estimation of the multivariate regression model for ef cient estimation that remains to be shown any consistent estimator of will do The designers of the 3SLS method Zellner and Theil 1962 suggest the natural choice arising out of the two stage least estimates The three stage least squares 3SLS estimator is thus de ned as follows 1 2 Estimate by ordinary least squares and compute Y j for each equation Compute j 2SLS for each equation then yi Zi i y j Z j j T Compute the GLS estimator according to 15 31 and an estimate of the asymptotic covariance matrix according to 15 32 using Z and i j 15 33

    3

    It is also possible to iterate the 3SLS computation Unlike the seemingly unrelated regressions estimator however this method does not provide the maximum likelihood estimator nor does it improve the asymptotic ef ciency 20
    20 A Jacobian term needed to maximize the log likelihood is not treated by the 3SLS estimator See Dhrymes

    1973

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    407

    By showing that the 3SLS estimator satis es the requirements for an IV estimator we have established its consistency The question of asymptotic ef ciency remains It can be shown that among all IV estimators that use only the sample information embodied in the system 3SLS is asymptotically ef cient 21 For normally distributed disturbances it can also be shown that 3SLS has the same asymptotic distribution as the full information maximum likelihood estimator which is asymptotically ef cient among all estimators A direct proof based on the information matrix is possible but we shall take a much simpler route by simply exploiting a handy result due to Hausman in the next section
    15 6 2 FULL INFORMATION MAXIMUM LIKELIHOOD

    Because of their simplicity and asymptotic ef ciency 2SLS and 3SLS are used almost exclusively when ordinary least squares is not used for the estimation of simultaneousequations models Nonetheless it is occasionally useful to obtain maximum likelihood estimates directly The full information maximum likelihood FIML estimator is based on the entire system of equations With normally distributed disturbances FIML is ef cient among all estimators The FIML estimator treats all equations and all parameters jointly To formulate the appropriate log likelihood function we begin with the reduced form Y X V

    where each row of V is assumed to be multivariate normally distributed with E vt X 0 and covariance matrix E vt vt X The log likelihood for this model is precisely that of the seemingly unrelated regressions model of Chapter 14 For the moment we can ignore the relationship between the structural and reduced form parameters Thus from 14 20 T ln L M ln 2 ln tr 2 where Wi j and 0 j th column of j 1 y X i0 T y X 0 j
    1

    W

    This function is to be maximized subject to all the restrictions imposed by the structure 1 1 Make the substitutions B 1 and 1 so that 1 Thus ln L T M ln 2 ln 2
    1



    1

    tr

    1 T

    1

    Y XB

    1

    Y XB

    1



    which can be simpli ed First
    21 See

    T ln 2

    1



    1



    T ln T ln 2

    Schmidt 1976 for a proof of its ef ciency relative to 2SLS

    Greene 50240

    book

    June 19 2002

    10 10

    408

    CHAPTER 15 Simultaneous Equations Models

    Second Y XB 1 Y B X By permuting end of the trace and collecting terms tr
    1

    from the beginning to the

    W tr

    1

    Y XB Y XB T

    Therefore the log likelihood is ln L where si j 1 Y T
    i

    T M ln 2 2 ln tr 2

    1

    S ln

    XBi Y

    j

    XB j

    In terms of nonzero parameters si j is i j of 15 32 In maximizing ln L it is necessary to impose all the additional restrictions on the structure The trace may be written in the form tr
    1

    S

    M i 1

    M j 1

    i j yi Yi i Xi i y j Y j j X j j T



    15 34

    Maximizing ln L subject to the exclusions in 15 34 and any other restrictions if necessary produces the FIML estimator This has all the desirable asymptotic properties of maximum likelihood estimators and therefore is asymptotically ef cient among estimators of the simultaneous equations model The asymptotic covariance matrix for the FIML estimator is the same as that for the 3SLS estimator A useful interpretation of the FIML estimator is provided by Dhrymes 1973 p 360 and Hausman 1975 1983 They show that the FIML estimator of is a xed point in the equation FIML Z 1 I Z 1 Z 1 I y Z Z 1 Z y where 11 Z1 12 Z1 22 Z2
    2M

    12 Z 1 I 2 Z 1 MZ M and



    1 M Z1 2 M Z2

    Z

    ZM

    MM Z M

    Z j X j X j is computed from the structural estimates j Mj columns of B 1 and i j 1 yi Zi i y j Z j j T and i j 1 i j

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    409

    This result implies that the FIML estimator is also an IV estimator The asymptotic covariance matrix for the FIML estimator follows directly from its form as an IV estimator Since this matrix is the same as that of the 3SLS estimator we conclude that with normally distributed disturbances 3SLS has the same asymptotic distribution as maximum likelihood The practical usefulness of this important result has not gone unnoticed by practitioners The 3SLS estimator is far easier to compute than the FIML estimator The bene t in computational cost comes at no cost in asymptotic ef ciency As always the small sample properties remain ambiguous but by and large where a systems estimator is used 3SLS dominates FIML nonetheless 22 One reservation arises from the fact that the 3SLS estimator is robust to nonnormality whereas because of the term ln in the log likelihood the FIML estimator is not In fact the 3SLS and FIML estimators are usually quite different numerically
    15 6 3 GMM ESTIMATION

    The GMM estimator for a system of equations is described in Section 14 4 3 As in the single equation case a minor change in notation produces the estimators of this chapter As before we will consider the case of unknown heteroscedasticity only The extension to autocorrelation is quite complicated See Cumby Huizinga and Obstfeld 1983 The orthogonality conditions de ned in 14 46 are E x t jt E x t y jt z j t j 0 If we consider all the equations jointly then we obtain the criterion for estimation of all the model s parameters
    M M

    q
    j 1 l 1 M M

    e zt j X X e zt l W jl T T m j W jl m l


    j 1 l 1

    where m j and W jl block jl of the weighting matrix W 1 As before we consider the optimal weighting matrix obtained as the asymptotic covari ance matrix of the empirical moments m j These moments are stacked in a single vector m Then the jl th block of Asy Var T m is
    jl

    1 T

    T

    x t y jt z j t j
    t 1

    plim

    1 T

    T

    x t xt y jt z j t j ylt zl t l
    t 1

    plim

    1 T

    T

    jl t x t xt
    t 1



    22 PC GIVE 8

    SAS and TSP 4 2 are three computer programs that are widely used A survey is given in

    Silk 1996

    Greene 50240

    book

    June 19 2002

    10 10

    410

    CHAPTER 15 Simultaneous Equations Models

    If the disturbances are homoscedastic then j l jl plim X X T is produced Otherwise we obtain a matrix of the form j l plim X j l X T Collecting terms then the criterion function for GMM estimation is X y2 Z2 2 T q X y M Z M M T For implementation
    jl jl

    X y1 Z1 1 T



    21
    M1

    11

    12 22



    1M

    1

    2M


    M2



    X y2 Z2 2 T X y M Z M M T

    X y1 Z1 1 T



    MM

    can be estimated with 1 T
    T



    x t xt y jt z j t d j ylt zl t dl
    t 1

    where d j is a consistent estimator of j The two stage least squares estimator is a natural choice For the diagonal blocks this choice is the White estimator as usual For the off diagonal blocks it is a simple extension With this result in hand the rst order conditions for GMM estimation are q 2 j
    jl M l 1

    ZjX T



    jl

    X yl Zl l T

    where is the jl th block in the inverse of the estimate if the center matrix in q The solution is M 1jyj ZX 1 j 1 1 Z X 11 X Z Z1 X 12 X Z2 Z1 X 1 M X Z M 1 1 1 GMM M Z X 21 X Z 22 X Z2 Z2 X 2 M X Z M Z2 X 2 j y j 2 Z2 X 1 2 GMM j 1 M GMM Z M X M1 X Z1 Z M X M2 X Z2 Z M X MM X Z M M Z MX Mj y j
    j 1

    The asymptotic covariance matrix for the estimator would be estimated with T times the large inverse matrix in brackets Several of the estimators we have already considered are special cases



    If If If

    jj jl jl

    j j X X T and j l 0 for j l then j is 2SLS j is H2SLS the single equation GMM estimator 0 for j l then jl X X T then j is 3SLS

    As before the GMM estimator brings ef ciency gains in the presence of heteroscedasticity If the disturbances are homoscedastic then it is asymptotically the same as 3SLS although in a nite sample it will differ numerically because S jl will not be identical to jl X X

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models 15 6 4 RECURSIVE SYSTEMS AND EXACTLY IDENTIFIED EQUATIONS

    411

    Finally there are two special cases worth noting First for the fully recursive model 1 2 is upper triangular with ones on the diagonal Therefore 1 and ln 0 is diagonal so ln M 1 ln j j and the trace in the exponent becomes j
    M

    tr

    1

    S
    j 1

    11 y j Y j j X j j y j Y j j X j j jj T
    M j 1

    The log likelihood reduces to ln L

    ln Lj where

    T 1 ln Lj ln 2 ln j j y j Y j j X j j y j Y j j X j j 2 2 j j Therefore the FIML estimator for this model is just equation by equation least squares We found earlier that ordinary least squares was consistent in this setting We now nd that it is asymptotically ef cient as well The second interesting special case occurs when every equation is exactly identi ed In this case K Mj in every equation It is straightforward to show that in this case j 2SLS 3SLS LIML FIML and j X Z j 1 X y j

    15 7

    COMPARISON OF METHODS KLEIN S MODEL I

    The preceding has described a large number of estimators for simultaneous equations models As an example Table 15 3 presents limited and full information estimates for Klein s Model I based on the original data for 1921 and 1941 The H3SLS estimates for the system were computed in two pairs C I and C W p because there were insuf cient observations to t the system as a whole The rst of these are reported for the C equation 23 It might seem in light of the entire discussion that one of the structural estimators described previously should always be preferred to ordinary least squares which alone among the estimators considered here is inconsistent Unfortunately the issue is not so clear First it is often found that the OLS estimator is surprisingly close to the structural estimator It can be shown that at least in some cases OLS has a smaller variance about its mean than does 2SLS about its mean leading to the possibility that OLS might be more precise in a mean squared error sense 24 But this result must be tempered by the nding that the OLS standard errors are in all likelihood not useful for inference purposes 25 Nonetheless OLS is a frequently used estimator Obviously this discussion
    23 The

    asymptotic covariance matrix for the LIML estimator will differ from that for the 2SLS estimator in a nite sample because the estimator of j j that multiplies the inverse matrix will differ and because in computing the matrix to be inverted the value of k see the equation after 15 26 is one for 2SLS and the smallest root in 15 25 for LIML Asymptotically k equals one and the estimators of j j are equivalent Goldberger 1964 pp 359 360 1967

    24 See

    25 Cragg

    Greene 50240

    book

    June 19 2002

    10 10

    412

    CHAPTER 15 Simultaneous Equations Models

    TABLE 15 3

    Estimates of Klein s Model I Estimated Asymptotic Standard Errors in Parentheses
    Full Information Estimates

    Limited Information Estimates

    C I Wp

    16 6 1 32 20 3 7 54 1 50 1 15 17 1 1 84 22 6 9 24 1 53 2 40 14 3 0 897 23 5 6 40 3 06 0 64 16 2 1 30 10 1 5 47 1 50 1 27

    2SLS 0 017 0 216 0 118 0 107 0 150 0 616 0 173 0 162 0 439 0 147 0 036 0 039 LIML 0 222 0 396 0 202 0 174 0 075 0 680 0 219 0 203 0 434 0 151 0 137 0 135

    0 810 0 040 0 158 0 036 0 130 0 029 0 823 0 055 0 168 0 044 0 132 0 065

    16 4 1 30 28 2 6 79 1 80 1 12 18 3 2 49 27 3 7 94 5 79 1 80 15 7 0 951 20 6 4 89 2 09 0 510 16 6 1 22 42 9 10 6 2 62 1 20

    3SLS 0 125 0 163 0 108 0 100 0 013 0 756 0 162 0 153 0 400 0 181 0 032 0 034 FIML 0 232 0 388 0 312 0 217 0 801 1 052 0 491 0 353 0 234 0 285 0 049 0 045 GMM H3SLS 0 068 0 167 0 091 0 080 0 213 0 520 0 087 0 099 0 446 0 131 0 019 0 021 I3SLS 0 165 0 177 0 096 0 090 0 356 1 01 0 260 0 249 0 375 0 194 0 031 0 032

    0 790 0 033 0 195 0 038 0 150 0 028 0 802 0 036 0 146 0 30 0 235 0 035 0 829 0 033 0 157 0 025 0 112 0 021 0 766 0 035 0 260 0 051 0 168 0 029

    C I Wp

    C I Wp

    GMM H2SLS 0 090 0 143 0 864 0 062 0 065 0 029 0 146 0 591 0 171 0 120 0 129 0 031 0 455 0 106 0 130 0 028 0 030 0 022 OLS 0 193 0 090 0 091 0 091 0 480 0 333 0 097 0 101 0 439 0 146 0 032 0 037 0 796 0 040 0 112 0 027 0 130 0 032

    C I Wp

    is relevant only to nite samples Asymptotically 2SLS must dominate OLS and in a correctly speci ed model any full information estimator must dominate any limitedinformation one The nite sample properties are of crucial importance Most of what we know is asymptotic properties but most applications are based on rather small or moderately sized samples The large difference between the inconsistent OLS and the other estimates suggests the bias discussed earlier On the other hand the incorrect sign on the LIML and FIML estimate of the coef cient on P and the even larger difference of the coef cient on P 1 in the C equation are striking Assuming that the equation is properly speci ed these anomalies would likewise be attributed to nite sample variation because LIML and 2SLS are asymptotically equivalent The GMM estimator is also striking The estimated standard errors are noticeably smaller for all the coef cients It should be noted however that this estimator is based on a presumption of heteroscedasticity when in this time series there is little evidence of its presence The results are broadly suggestive

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    413

    but the appearance of having achieved something for nothing is deceiving Our earlier results on the ef ciency of 2SLS are intact If there is heteroscedasticity then 2SLS is no longer fully ef cient but then again neither is H2SLS The latter is more ef cient than the former in the presence of heteroscedasticity but it is equivalent to 2SLS in its absence Intuition would suggest that systems methods 3SLS GMM and FIML are to be preferred to single equation methods 2SLS and LIML Indeed since the advantage is so transparent why would one ever choose a single equation estimator The proper analogy is to the use of single equation OLS versus GLS in the SURE model of Chapter 14 An obvious practical consideration is the computational simplicity of the single equation methods But the current state of available software has all but eliminated this advantage Although the systems methods are asymptotically better they have two problems First any speci cation error in the structure of the model will be propagated throughout the system by 3SLS or FIML The limited information estimators will by and large con ne a problem to the particular equation in which it appears Second in the same fashion as the SURE model the nite sample variation of the estimated covariance matrix is transmitted throughout the system Thus the nite sample variance of 3SLS may well be as large as or larger than that of 2SLS Although they are only single estimates the results for Klein s Model I give a striking example The upshot would appear to be that the advantage of the systems estimators in nite samples may be more modest than the asymptotic results would suggest Monte Carlo studies of the issue have tended to reach the same conclusion 26

    15 8

    SPECIFICATION TESTS

    In a strident criticism of structural estimation Liu 1960 argued that all simultaneousequations models of the economy were truly unidenti ed and that only reduced forms could be estimated Although his criticisms may have been exaggerated and never gained wide acceptance modelers have been interested in testing the restrictions that overidentify an econometric model The rst procedure for testing the overidentifying restrictions in a model was developed by Anderson and Rubin 1950 Their likelihood ratio test statistic is a by product of LIML estimation LR 2 K Mj T j 1 j where j is the root used to nd the LIML estimator See 15 27 The statistic has a limiting chi squared distribution with degrees of freedom equal to the number of overidentifying restrictions A large value is taken as evidence that there are exogenous variables in the model that have been inappropriately omitted from the equation being examined If the equation is exactly identi ed then K Mj 0 but at the same j time the root will be 1 An alternative based on the Lagrange multiplier principle was
    26 See

    Cragg 1967 and the many related studies listed by Judge et al 1985 pp 646 653

    Greene 50240

    book

    June 19 2002

    10 10

    414

    CHAPTER 15 Simultaneous Equations Models

    proposed by Hausman 1983 p 433 Operationally the test requires only the calcula tion of T R2 where the R2 is the uncentered R2 in the regression of j y j Z j j on all the predetermined variables in the model The estimated parameters may be computed using 2SLS LIML or any other ef cient limited information estimator The statistic has a limiting chi squared distribution with K Mj degrees of freedom under the assumed j speci cation of the model Another speci cation error occurs if the variables assumed to be exogenous in the system are in fact correlated with the structural disturbances Since all the asymptotic properties claimed earlier rest on this assumption this speci cation error would be quite serious Several authors have studied this issue 27 The speci cation test devised by Hausman that we used in Section 5 5 in the errors in variables model provides a method of testing for exogeneity in a simultaneous equations model Suppose that the variable x e is in question The test is based on the existence of two estimators say and such that under H0 x e is exogenous both and are consistent and is asymptotically ef cient under H1 x e is endogenous is consistent but is inconsistent Hausman bases his version of the test on being the 2SLS estimator and being the 3SLS estimator A shortcoming of the procedure is that it requires an arbitrary choice of some equation that does not contain x e for the test For instance consider the exogeneity of X 1 in the third equation of Klein s Model I To apply this test we must use one of the other two equations A single equation version of the test has been devised by Spencer and Berk 1981 We suppose that x e appears in equation j so that y j Y j j X j j xe j Y j X j xe j j Then is the 2SLS estimator treating x e as an exogenous variable in the system whereas is the IV estimator based on regressing y j on Y j X j x e where the least squares tted values are based on all the remaining exogenous variables excluding x e The test statistic is then w Est Var Est Var
    1



    15 35

    which is the Wald statistic based on the difference of the two estimators The statistic has one degree of freedom The extension to a set of variables is direct
    Example 15 8 Testing Overidentifying Restrictions

    For Klein s Model I the test statistics and critical values for the chi squared distribution for the overidentifying restrictions for the three equations are given in Table 15 4 There are 20 observations used to estimate the model and eight predetermined variables The overidentifying restrictions for the wage equation are rejected by both single equation tests There are two possibilities The equation may well be misspeci ed Or as Liu suggests in a
    27 Wu

    1973 Durbin 1954 Hausman 1978 Nakamura and Nakamura 1981 and Dhrymes 1994

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    415

    TABLE 15 4

    Test Statistics and Critical Values
    Chi Squared Critical Values LR T R2 K M j j 2 2 2 3

    Consumption Investment Wages

    1 499 1 086 2 466

    9 98 1 72 29 3

    8 77 1 81 12 49

    2 3 3

    5 1

    5 99 9 21

    7 82 11 34

    dynamic model if there is autocorrelation of the disturbances then the treatment of lagged endogenous variables as if they were exogenous is a speci cation error The results above suggest a speci cation problem in the third equation of Klein s Model I To pursue that nding we now apply the preceding to test the exogeneity of X 1 The two estimated parameter vectors are 1 5003 0 43886 0 14667 0 13040 i e 2SLS and 1 2524 0 42277 0 167614 0 13062 Using the Wald criterion the chi squared statistic is 1 3977 Thus the hypothesis such as it is is not rejected

    15 9

    PROPERTIES OF DYNAMIC MODELS

    In models with lagged endogenous variables the entire previous time path of the exogenous variables and disturbances not just their current values determines the current value of the endogenous variables The intrinsic dynamic properties of the autoregressive model such as stability and the existence of an equilibrium value are embodied in their autoregressive parameters In this section we are interested in long and short run multipliers stability properties and simulated time paths of the dependent variables
    15 9 1 DYNAMIC MODELS AND THEIR MULTIPLIERS

    The structural form of a dynamic model is yt xt B yt 1 t 15 36

    If the model contains additional lags then we can add additional equations to the system of the form yt 1 yt 1 For example a model with two periods of lags would be written yt yt 1 0 0 xt B I 0 yt 1 yt 2
    1 2

    I t 0

    0

    which can be treated as a model with only a single lag this is in the form of 15 36 The reduced form is yt xt where B
    1

    yt 1

    vt

    Greene 50240

    book

    June 19 2002

    10 10

    416

    CHAPTER 15 Simultaneous Equations Models

    and From the reduced form yt m xt k
    km 1



    The short run effects are the coef cients on the current x s so multipliers By substituting for yt 1 in 15 36 we obtain yt xt xt 1 yt 2
    2

    is the matrix of impact

    vt vt 1

    This manipulation can easily be done with the lag operator see Section 19 2 2 but it is just as convenient to proceed in this fashion for the present Continuing this method for the full t periods we obtain
    t 1 t 1

    yt
    s 0

    xt s

    s

    y0

    t


    s 0

    vt s

    s



    15 37

    This shows how the initial conditions y0 and the subsequent time path of the exogenous variables and disturbances completely determine the current values of the endogenous variables The coef cient matrices in the bracketed sum are the dynamic multipliers yt m xt s k
    s

    km

    The cumulated multipliers are obtained by adding the matrices of dynamic multipliers If we let s go to in nity in 15 37 then we obtain the nal form of the model 28


    yt
    s 0

    xt s

    s


    s 0

    vt s

    s

    is nilpotent Then

    Assume for the present that limt t 0 This says that the matrix of cumulated multipliers in the nal form is I
    2



    I

    1

    These coef cient matrices are the long run or equilibrium multipliers We can also obtain the cumulated multipliers for s periods as cumulated multipliers I 1 I
    s



    Suppose that the values of x were permanently xed at x Then the nal form shows that if there are no disturbances the equilibrium value of yt would be


    y
    s 0
    28 In

    x

    s

    x
    s 0

    s

    x

    I

    1

    15 38

    some treatments 15 37 is labeled the nal form instead Both forms eliminate the lagged values of the dependent variables from the current value The dependence of the rst form on the initial values may make it simpler to interpret than the second form

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    417

    Therefore the equilibrium multipliers are ym I xk 1 km

    Some examples are shown below for Klein s Model I
    15 9 2 STABILITY

    It remains to be shown that the matrix of multipliers in the nal form converges For the analysis to proceed it is necessary for the matrix t to converge to a zero matrix Although is not a symmetric matrix it will still have a spectral decomposition of the form C C 1 where is a diagonal matrix containing the characteristic roots of of C is a right characteristic vector cm mcm Since is not symmetric the elements of A 105 continues to hold
    2

    15 39 and each column 15 40

    and C may be complex Nonetheless
    2

    C C 1 C C 1 C

    C 1

    15 41

    and
    t

    C

    t

    C 1

    It is apparent that whether or not t vanishes as t depends on its characteristic roots The condition is m 1 For the case of a complex root m a bi a 2 b2 For a given model the stability may be established by examining the largest or dominant root With many endogenous variables in the model but only a few lagged variables is a large but sparse matrix Finding the characteristic roots of large asymmetric matrices is a rather complex computation problem although there exists specialized software for doing so There is a way to make the problem a bit more compact In the context of an example in Klein s Model I is 6 6 but with three rows of zeros it has only rank three and three nonzero roots See Table 15 5 in Example 15 9 following The following partitioning is useful Let yt 1 be the set of endogenous variables that appear in both current and lagged form and let yt 2 be those that appear only in current form Then the model may be written yt 1 yt 2 xt
    1 2

    yt 1 1

    yt 1 2

    1

    2

    0

    0

    vt 1

    vt 2

    15 42

    The characteristic roots of are de ned by the characteristic polynomial I 0 For the partitioned model this result is
    1

    I 0

    2

    I

    0

    Greene 50240

    book

    June 19 2002

    10 10

    418

    CHAPTER 15 Simultaneous Equations Models

    We may use A 72 to obtain I M2
    1

    I 0

    where M2 is the number of variables in y2 Consequently we need only concern ourselves with the submatrix of that de nes explicit autoregressions The part of the reduced form de ned by yt 2 xt 2 yt 1 1 2 is not directly relevant
    15 9 3 ADJUSTMENT TO EQUILIBRIUM

    The adjustment of a dynamic model to an equilibrium involves the following conceptual experiment We assume that the exogenous variables x t have been xed at a level x for a long enough time that the endogenous variables have fully adjusted to their equilibrium y de ned in 15 38 In some arbitrarily chosen period labeled period 0 an exogenous one time shock hits the system so that in period t 0 x t x0 x Thereafter x t returns to its former value x and x t x for all t 0 We know from the expression for the nal form that if disturbed yt will ultimately return to the equilibrium That situation is ensured by the stability condition Here we consider the time path of the adjustment Since our only concern at this point is with the exogenous shock we will ignore the disturbances in the analysis At time 0 y 0 x0 y 1 But prior to time 0 the system was in equilibrium so y0 x 0 y The initial displacement due to the shock to x is y0 y x 0 Substituting x y I produces y0 y x0 x 15 43 y I

    As might be expected the initial displacement is determined entirely by the exogenous shock occurring in that period Since x t x after period 0 15 37 implies that
    t 1

    yt
    s 0

    x I
    t

    s

    y0

    t

    x

    1 I y0
    t t

    t

    y0

    t

    y y

    y y0 y



    Thus the entire time path is a function of the initial displacement By inserting 15 43 we see that yt y x0 x
    t



    15 44

    Since limt t 0 the path back to the equilibrium subsequent to the exogenous shock x0 x is de ned The stability condition imposed on ensures that if the system is disturbed at some point by a one time shock then barring further shocks or

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    419

    disturbances it will return to its equilibrium Since y0 x x0 and are xed for all time the shape of the path is completely determined by the behavior of t which we now examine In the preceding section in 15 39 to 15 42 we used the characteristic roots of to infer the lack of stability of the model The spectral decomposition of t given in 15 41 may be written
    M t


    m 1

    tmcmdm

    where cm is the mth column of C and dm is the mth row of C 1 29 Inserting this result in 15 44 gives
    M

    yt y x0 x
    M


    m 1

    tmcmdm
    M


    m 1

    tm x0 x

    cmdm
    m 1

    tmgm

    Note that this equation may involve fewer than M terms since some of the roots may be zero For Klein s Model I M 6 but there are only three nonzero roots Since gm depends only on the initial conditions and the parameters of the model the behavior of the time path of yt y is completely determined by tm In each period the deviation from the equilibrium is a sum of M terms of powers of m times a constant Each variable has its own set of constants The terms in the sum behave as follows m real 0 m real 0 m complex tm adds a damped exponential term tm adds a damped sawtooth term tm adds a damped sinusoidal term

    If we write the complex root m a bi in polar form then A cos B i sin B where A a 2 b2 1 2 and B arc cos a A in radians the sinusoidal components each have amplitude At and period 2 B 30
    Example 15 9 Dynamic Model

    The 2SLS estimates of the structure and reduced form of Klein s Model I are given in Table 15 5 Only the nonzero rows of and are shown For the 2SLS estimates of Klein s Model I the relevant submatrix of is



    1 1 511 0 287

    K 0 172

    P 0 051 0 848 0 161

    K 0 008 X 1 0 743 P 1 0 818 K 1



    29 See

    Section A 6 9 1964 p 378

    30 Goldberger

    Greene 50240

    book

    June 19 2002

    10 10

    420

    CHAPTER 15 Simultaneous Equations Models

    TABLE 15 5

    2SLS Estimates of Coef cient Matrices in Klein s Model I
    Equation C I Wp X P K

    Variable



    C I Wp X P K 1 Wg T G A X 1 P 1 K 1 1 Wg T G A X 1 P 1 K 1

    1 0 0 810 0 0 017 0 16 555 0 810 0 0 0 0 0 216 0 42 80 1 35 0 128 0 663 0 159 0 179 0 767 0 105

    0 1 0 0 0 15 0 20 278 0 0 0 0 0 0 6160 0 158 25 83 0 124 0 176 0 153 0 007 0 008 0 743 0 182

    0 0 1 0 439 0 0 1 5 0 0 0 0 13 0 147 0 0 31 63 0 646 0 133 0 797 0 197 0 222 0 663 0 125

    1 1 0 1 0 0 0 0 0 1 0 0 0 0 68 63 1 47 0 303 1 82 0 152 0 172 1 511 0 287

    0 0 1 1 1 0 0 0 1 0 0 0 0 0 37 00 0 825 1 17 1 02 0 045 0 051 0 848 0 161

    0 1 0 0 0 1 0 0 0 0 0 0 0 1 25 83 0 125 0 176 0 153 0 007 0 008 0 743 0 818

    B







    The characteristic roots of this matrix are 0 2995 and the complex pair 0 7692 0 3494i 0 8448 cos 0 4263 i sin 0 4263 The moduli of the complex roots are 0 8448 so we conclude that the model is stable The period for the oscillations is 2 0 4263 14 73 periods years See Figure 15 2

    For a particular variable or group of variables the various multipliers are submatrices of the multiplier matrices The dynamic multipliers based on the estimates in Table 15 5 for the effects of the policy variables T and G on output X are plotted in Figure 15 2 for current and 20 lagged values A plot of the period multipliers against the lag length is called the impulse response function The policy effects on output are shown in Figure 15 2 The damped sine wave pattern is characteristic of a dynamic system with imaginary roots When the roots are real the impulse response function is a monotonically declining function instead This model has the interesting feature that the long run multipliers of both policy variables for investment are zero This is intrinsic to the model The estimated longrun balanced budget multiplier for equal increases in spending and taxes is 2 10 1 48 0 62

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    421

    2
    Taxes Spending

    1

    0

    1

    2 0
    FIGURE 15 2

    5

    10 Lag

    15

    20

    Impulse Response Function

    15 10

    SUMMARY AND CONCLUSIONS

    The models surveyed in this chapter involve most of the issues that arise in analysis of linear equations in econometrics Before one embarks on the process of estimation it is necessary to establish that the sample data actually contain suf cient information to provide estimates of the parameters in question This is the question of identi cation Identi cation involves both the statistical properties of estimators and the role of theory in the speci cation of the model Once identi cation is established there are numerous methods of estimation We considered a number of single equation techniques including least squares instrumental variables GMM and maximum likelihood Fully ef cient use of the sample data will require joint estimation of all the equations in the system Once again there are several techniques these are extensions of the single equation methods including three stage least squares GMM and full information maximum likelihood In both frameworks this is one of those benign situations in which the computationally simplest estimator is generally the most ef cient one In the nal section of this chapter we examined the special properties of dynamic models An important consideration in this analysis was the stability of the equations Modern macroeconometrics involves many models in which one or more roots of the dynamic system equal one so that these models in the simple autoregressive form are unstable In terms of the analysis in Section 15 9 3 in such a model a shock to the system is permanent the effects do not die out We will examine a model of monetary policy with these characteristics in Example 19 6 8

    Greene 50240

    book

    June 19 2002

    10 10

    422

    CHAPTER 15 Simultaneous Equations Models

    Key Terms and Concepts
    Admissible Behavioral equation Causality Complete system Completeness condition Consistent estimates Cumulative multiplier Dominant root Dynamic model Dynamic multiplier Econometric model Endogenous Equilibrium condition Equilibrium multipliers Exactly identi ed model Exclusion restrictions Exogenous FIML Final form Full information Fully recursive model GMM estimation Granger causality Identi cation Impact multiplier Impulse response function Indirect least squares Initial conditions Instrumental variable Problem of identi cation Rank condition Recursive model Reduced form Reduced form disturbance Restrictions Simultaneous equations

    estimator
    Interdependent Jointly dependent k class Least variance ratio Limited information LIML Nonlinear system Nonsample information Nonstructural Normalization Observationally equivalent Order condition Overidenti cation Predetermined variable

    bias
    Speci cation test Stability Structural disturbance Structural equation System methods of

    estimation
    Three stage least squares Triangular system Two stage least squares Weakly exogenous

    Exercises 1 Consider the following two equation model y1 1 y2 11 x1 21 x2 31 x3 1 y2 2 y1 12 x1 22 x2 32 x3 2 a Verify that as stated neither equation is identi ed b Establish whether or not the following restrictions are suf cient to identify or partially identify the model 1 2 3 4 5 6 7 8 9 21 32 0 12 22 0 1 0 1 2 and 32 0 12 0 and 31 0 1 0 and 12 0 21 22 1 12 0 21 22 31 32 0 12 0 11 21 22 31 32 0

    2 Verify the rank and order conditions for identi cation of the second and third behavioral equations in Klein s Model I

    Greene 50240

    book

    June 19 2002

    10 10

    CHAPTER 15 Simultaneous Equations Models

    423

    3 Check the identi ability of the parameters of the following model 0 1 12 0 21 1 23 24 y1 y2 y3 y4 0 32 1 34 41 21 x5 31 0 0 0 42 12 1 32 0 52 0 13 0 33 43 0 1 24 0 1 44 0 14

    x1

    x2

    x3

    x4

    2

    3

    4

    4 Obtain the reduced form for the model in Exercise 1 under each of the assumptions made in parts a and in parts b1 and b9 5 The following model is speci ed y1 1 y2 11 x1 1 y2 2 y1 22 x2 32 x3 2 All variables are measured as deviations from their means The sample of 25 observations produces the following matrix of sums of squares and cross products y y2 1 y1 20 6 y2 6 10 3 x1 4 6 x2 3 x3 5 7 x1 x2 x3 43 5 36 7 52 3 2 10 8 3 8 15

    a Estimate the two equations by OLS b Estimate the parameters of the two equations by 2SLS Also estimate the asymptotic covariance matrix of the 2SLS estimates c Obtain the LIML estimates of the parameters of the rst equation d Estimate the two equations by 3SLS e Estimate the reduced form coef cient matrix by OLS and indirectly by using your structural estimates from Part b 6 For the model y1 1 y2 11 x1 21 x2 1 y2 2 y1 32 x3 42 x4 2 show that there are two restrictions on the reduced form coef cients Describe a procedure for estimating the model while incorporating the restrictions

    Greene 50240

    book

    June 19 2002

    10 10

    424

    CHAPTER 15 Simultaneous Equations Models

    7 An updated version of Klein s Model I was estimated The relevant submatrix of is 0 1899 0 9471 0 8991 0 9287 0 1 0 0 0656 0 0791 0 0952 Is the model stable 8 Prove that j jj j T 9 Prove that an underidenti ed equation cannot be estimated by 2SLS plim Yj j

    Greene 50240

    book

    June 20 2002

    18 2

    16

    ESTIMATION FRAMEWORKS IN ECONOMETRICS

    Q
    16 1 INTRODUCTION This chapter begins our treatment of methods of estimation Contemporary econometrics offers the practitioner a remarkable variety of estimation methods ranging from tightly parameterized likelihood based techniques at one end to thinly stated nonparametric methods that assume little more than mere association between variables at the other and a rich variety in between Even the experienced researcher could be forgiven for wondering how they should choose from this long menu It is certainly beyond our scope to answer this question here but a few principles can be suggested Recent research has leaned when possible toward methods that require few or fewer possibly unwarranted or improper assumptions This explains the ascendance of the GMM estimator in situations where strong likelihood based parameterizations can be avoided and robust estimation can be done in the presence of heteroscedasticity and serial correlation It is intriguing to observe that this is occurring at a time when advances in computation have helped bring about increased acceptance of very heavily parameterized Bayesian methods As a general proposition the progression from full to semi to non parametric estimation relaxes strong assumptions but at the cost of weakening the conclusions that can be drawn from the data As much as anywhere else this is clear in the analysis of discrete choice models which provide one of the most active literatures in the eld A sampler appears in Chapter 21 A formal probit or logit model allows estimation of probabilities marginal effects and a host of ancillary results but at the cost of imposing the normal or logistic distribution on the data Semiparametric and nonparametric estimators allow one to relax the restriction but often provide in return only ranges of probabilities if that and in many cases preclude estimation of probabilities or useful marginal effects One does have the virtue of robustness in the conclusions however See e g the symposium in Angrist 2001 for a spirited discussion on these points Estimation properties is another arena in which the different approaches can be compared Within a class of estimators one can de ne the best most ef cient means of using the data See Example 16 2 below for an application Sometimes comparisons can be made across classes as well For example when they are estimating the same parameters this remains to be established the best parametric estimator will generally outperform the best semiparametric estimator That is the value of the information of course The other side of the comparison however is that the semiparametric estimator will carry the day if the parametric model is misspeci ed in a fashion to which the semiparametric estimator is robust and the parametric model is not
    425

    Greene 50240

    book

    June 20 2002

    18 2

    426

    CHAPTER 16 Estimation Frameworks in Econometrics

    Schools of thought have entered this conversation for a long time Proponents of Bayesian estimation often took an almost theological viewpoint in their criticism of their classical colleagues See for example Poirier 1995 Contemporary practitioners are usually more pragmatic than this Bayesian estimation has gained currency as a set of techniques that can in very many cases provide both elegant and tractable solutions to problems that have heretofore been out of reach Thus for example the simulationbased estimation advocated in the many papers of Chib and Greenberg e g 1996 have provided solutions to a variety of computationally challenging problems 1 Arguments as to the methodological virtue of one approach or the other have received much less attention than before Chapters 2 though 9 of this book have focused on the classical regression model and a particular estimator least squares linear and nonlinear In this and the next two chapters we will examine several general estimation strategies that are used in a wide variety of situations This chapter will survey a few methods in the three broad areas we have listed including Bayesian methods Chapter 17 presents the method of maximum likelihood the broad platform for parametric classical estimation in econometrics Chapter 18 discusses the generalized method of moments which has emerged as the centerpiece of semiparametric estimation Sections 16 2 4 and 17 8 will examine two speci c estimation frameworks one Bayesian and one classical that are based on simulation methods This is a recently developed body of techniques that have been made feasible by advances in estimation technology and which has made quite straightforward many estimators which were previously only scarcely used because of the sheer dif culty of the computations The list of techniques presented here is far from complete We have chosen a set that constitute the mainstream of econometrics Certainly there are others that might be considered See for example Mittelhammer Judge and Miller 2000 for a lengthy catalog Virtually all of them are the subject of excellent monographs on the subject In this chapter we will present several applications some from the literature some home grown to demonstrate the range of techniques that are current in econometric practice We begin in Section 16 2 with parametric approaches primarily maximum likelihood Since this is the subject of much of the remainder of this book this section is brief Section 16 2 also presents Bayesian estimation which in its traditional form is as heavily parameterized as maximum likelihood estimation This section focuses mostly on the linear model A few applications of Bayesian techniques to other models are presented as well We will also return to what is currently the standard toolkit in Bayesian estimation Markov Chain Monte Carlo methods in Section 16 2 4 Section 16 2 3 presents an emerging technique in the classical tradition latent class modeling which makes interesting use of a fundamental result based on Bayes Theorem Section 16 3 is on semiparametric estimation GMM estimation is the subject of all of Chapter 18 so it is
    1 The

    penetration of Bayesian econometrics could be overstated It is fairly well represented in the current journals such as the Journal of Econometrics Journal of Applied Econometrics Journal of Business and Economic Statistics and so on On the other hand in the six major general treatments of econometrics published in 2000 four Hayashi Ruud Patterson Davidson do not mention Bayesian methods at all a buffet of 32 essays Baltagi devotes only one to the subject and the one that displays any preference Mittelhammer et al devotes nearly 10 percent 70 of its pages to Bayesian estimation but all to the broad metatheory or the linear regression model and none to the more elaborate applications that form the received applications in the many journals in the eld

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    427

    only introduced here The technique of least absolute deviations is presented here as well A range of applications from the recent literature is also surveyed Section 16 4 describes nonparametric estimation The fundamental tool the kernel density estimator is developed then applied to a problem in regression analysis Two applications are presented here as well Being focused on application this chapter will say very little about the statistical theory for of these techniques such as their asymptotic properties The results are developed at length in the literature of course We will turn to the subject of the properties of estimators brie y at the end of the chapter in Section 16 5 then in greater detail in Chapters 17 and 18

    16 2

    PARAMETRIC ESTIMATION AND INFERENCE

    Parametric estimation departs from a full statement of the density or probability model that provides the data generating mechanism for a random variable of interest For the sorts of applications we have considered thus far we might say that the joint density of a scalar random variable y and a random vector x of interest can be speci ed by f y x g y x h x 16 1

    with unknown parameters and To continue the application that has occupied us since Chapter 2 consider the linear regression model with normally distributed disturbances The assumption produces a full statement of the conditional density that is the population from which an observation is drawn yi xi N xi 2 All that remains for a full de nition of the population is knowledge of the speci c values taken by the unknown but xed parameters With those in hand the conditional probability distribution for yi is completely de ned mean variance probabilities of certain events and so on The marginal density for the conditioning variables is usually not of particular interest Thus the signature features of this modeling platform are speci cation of both the density and the features parameters of that density The parameter space for the parametric model is the set of allowable values of the parameters which satisfy some prior speci cation of the model For example in the regression model speci ed previously the K regression slopes may take any real value but the variance must be a positive number Therefore the parameter space for that model is 2 R K R Estimation in this context consists of specifying a criterion for ranking the points in the parameter space then choosing that point a point estimate or a set of points an interval estimate that optimizes that criterion that is has the best ranking Thus for example we chose linear least squares as one estimation criterion for the linear model Inference in this setting is a process by which some regions of the already speci ed parameter space are deemed not to contain the unknown parameters though in more practical terms we typically de ne a criterion and then state that by that criterion certain regions are unlikely to contain the true parameters

    Greene 50240

    book

    June 20 2002

    18 2

    428

    CHAPTER 16 Estimation Frameworks in Econometrics 16 2 1 CLASSICAL LIKELIHOOD BASED ESTIMATION

    The most common by far class of parametric estimators used in econometrics is the maximum likelihood estimators The underlying philosophy of this class of estimators is the idea of sample information When the density of a sample of observations is completely speci ed apart from the unknown parameters then the joint density of those observations assuming they are independent is the likelihood function
    n

    f y1 y2 x1 x2
    i 1

    f yi xi

    16 2

    This function contains all the information available in the sample about the population from which those observations were drawn The strategy by which that information is used in estimation constitutes the estimator The maximum likelihood estimator Fisher 1925 is that function of the data which as its name implies maximizes the likelihood function or because it is usually more convenient the log of the likelihood function The motivation for this approach is most easily visualized in the setting of a discrete random variable In this case the likelihood function gives the joint probability for the observed sample observations and the maximum likelihood estimator is the function of the sample information which makes the observed data most probable at least by that criterion Though the analogy is most intuitively appealing for a discrete variable it carries over to continuous variables as well Since this estimator is the subject of Chapter 17 which is quite lengthy we will defer any formal discussion until then and consider instead two applications to illustrate the techniques and underpinnings
    Example 16 1 The Linear Regression Model

    Least squares weighs negative and positive deviations equally and gives disproportionate weight to large deviations in the calculation This property can be an advantage or a disadvantage depending on the data generating process For normally distributed disturbances this method is precisely the one needed to use the data most ef ciently If the data are generated by a normal distribution then the log of the likelihood function is n 1 n ln L ln 2 ln 2 y X y X 2 2 2 2 You can easily show that least squares is the estimator of choice for this model Maximizing the function means minimizing the exponent which is done by least squares for and e e n for 2 If the appropriate distribution is deemed to be something other than normal perhaps on the basis of an observation that the tails of the disturbance distribution are too thick see Example 5 1 and Section 17 6 3 then there are three ways one might proceed First as we have observed the consistency of least squares is robust to this failure of the speci cation so long as the conditional mean of the disturbances is still zero Some correction to the standard errors is necessary for proper inferences See Section 10 3 Second one might want to proceed to an estimator with better nite sample properties The least absolute deviations estimator discussed in Section 16 3 2 is a candidate Finally one might consider some other distribution which accommodates the observed discrepancy For example Ruud 2000 examines in some detail a linear regression model with disturbances distributed according to the t distribution with v degrees of freedom As long as v is nite this random variable will have a larger variance than the normal Which way should one proceed The third approach is the least appealing Surely if the normal distribution is inappropriate then it would be dif cult to come up with a plausible mechanism whereby the t distribution would not be The LAD estimator might well be preferable if the sample were small If not then least

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    429

    squares would probably remain the estimator of choice with some allowance for the fact that standard inference tools would probably be misleading Current practice is generally to adopt the rst strategy
    Example 16 2 The Stochastic Frontier Model

    The stochastic frontier model discussed in detail in Section 17 6 3 is a regression like model with a disturbance that is asymmetric and distinctly nonnormal See Figure 17 3 The conditional density for the dependent variable in this model is 2 y x 2 y x f y x exp 2 2 This produces a log likelihood function for the model ln L n ln n2 1 ln 2 2
    n

    i 1

    i

    2

    n


    i 1

    ln

    i

    There are at least two fully parametric estimators for this model The maximum likelihood estimator is discussed in Section 17 6 3 Greene 1997b presents the following method of moments estimator For the regression slopes excluding the constant term use least squares For the parameters and based on the second and third moments of the least squares residuals and least squares constant solve
    2 m2 v2 1 2 u 3 m3 2 1 2 1 4 u

    a 2 2 u
    2 where u v and 2 u v2 Both estimators are fully parametric The maximum likelihood estimator is for the reasons discussed earlier The method of moments estimators see Section 18 2 are appropriate only for this distribution Which is preferable As we will see in Chapter 17 both estimators are consistent and asymptotically normally distributed By virtue of the Cramer Rao theorem the maximum likelihood estimator has a smaller asymptotic variance Neither has any small sample optimality properties Thus the only virtue of the method of moments estimator is that one can compute it with any standard regression statistics computer package and a hand calculator whereas the maximum likelihood estimator requires specialized software only somewhat it is reasonably common

    16 2 2

    BAYESIAN ESTIMATION

    Parametric formulations present a bit of a methodological dilemma They would seem to straightjacket the researcher into a xed and immutable speci cation of the model But in any analysis there is uncertainty as to the magnitudes and even on occasion the signs of coef cients It is rare that the presentation of a set of empirical results has not been preceded by at least some exploratory analysis Proponents of the Bayesian methodology argue that the process of estimation is not one of deducing the values of xed parameters but rather one of continually updating and sharpening our subjective beliefs about the state of the world The centerpiece of the Bayesian methodology is Bayes theorem for events A and B the conditional probability of event A given that B has occurred is P A B P B A P A P B

    Greene 50240

    book

    June 20 2002

    18 2

    430

    CHAPTER 16 Estimation Frameworks in Econometrics

    Paraphrased for our applications here we would write P parameters data P data parameters P parameters P data

    In this setting the data are viewed as constants whose distributions do not involve the parameters of interest For the purpose of the study we treat the data as only a xed set of additional information to be used in updating our beliefs about the parameters Note the similarity to the way that the joint density for our parametric model is speci ed in 16 1 Thus we write P parameters data P data parameters P parameters Likelihood function Prior density The symbol means is proportional to In the preceding equation we have dropped the marginal density of the data so what remains is not a proper density until it is scaled by what will be an inessential proportionality constant The rst term on the right is the joint distribution of the observed random variables y given the parameters As we shall analyze it here this distribution is the normal distribution we have used in our previous analysis see 16 1 The second term is the prior beliefs of the analyst The left hand side is the posterior density of the parameters given the current body of data or our revised beliefs about the distribution of the parameters after seeing the data The posterior is a mixture of the prior information and the current information that is the data Once obtained this posterior density is available to be the prior density function when the next body of data or other usable information becomes available The principle involved which appears nowhere in the classical analysis is one of continual accretion of knowledge about the parameters Traditional Bayesian estimation is heavily parameterized The prior density and the likelihood function are crucial elements of the analysis and both must be fully speci ed for estimation to proceed The Bayesian estimator is the mean of the posterior density of the parameters a quantity that is usually obtained either by integration when closed forms exist approximation of integrals by numerical techniques or by Monte Carlo methods which are discussed in Section 16 2 4
    16 2 2 a BAYESIAN ANALYSIS OF THE CLASSICAL REGRESSION MODEL

    The complexity of the algebra involved in Bayesian analysis is often extremely burdensome For the linear regression model however many fairly straightforward results have been obtained To provide some of the avor of the techniques we present the full derivation only for some simple cases In the interest of brevity and to avoid the burden of excessive algebra we refer the reader to one of the several sources that present the full derivation of the more complex cases 2 The classical normal regression model we have analyzed thus far is constructed around the conditional multivariate normal distribution N X 2 I The interpretation is different here In the sampling theory setting this distribution embodies the
    2 These

    sources include Judge et al 1982 1985 Maddala 1977a Mittelhammer et al 2000 and the canonical reference for econometricians Zellner 1971 Further topics in Bayesian inference are contained in Zellner 1985 A recent treatment of both Bayesian and sampling theory approaches is Poirier 1995

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    431

    information about the observed sample data given the assumed distribution and the xed albeit unknown parameters of the model In the Bayesian setting this function summarizes the information that a particular realization of the data provides about the assumed distribution of the model parameters To underscore that idea we rename this joint density the likelihood for and 2 given the data so L 2 y X 2 2 n 2 e 1 2
    2

    y X y X



    16 3

    For purposes of the results below some reformulation is useful Let d n K the degrees of freedom parameter and substitute y X y Xb X b e X b in the exponent Expanding this produces 1 y X y X 2 2 1 ds2 2 1 2 1 b 2 1 X X b 2

    After a bit of manipulation note that n 2 d 2 K 2 the likelihood may be written L 2 y X 2 d 2 2 d 2 e d 2 s
    2

    2

    2 K 2 2 K 2 e 1 2 b

    2

    X X 1 1 b



    This density embodies all that we have to learn about the parameters from the observed data Since the data are taken to be constants in the joint density we may multiply this joint density by the very carefully chosen inessential since it does not involve or 2 constant function of the observations A d 2 d 2 1 s 2 2 d 2 X X 1 2 d 1 2 1 2
    2

    For convenience let v d 2 Then multiplying L 2 y X by A gives L 2 y X v s 2 v 1 v 1
    v

    e vs

    2

    1 2

    2 K 2 2 X X 1 1 2 16 4

    e 1 2 b

    X X 1 1 b

    The likelihood function is proportional to the product of a gamma density for z 1 2 with parameters v s 2 and P v 1 see B 39 this is an inverted gamma distribution and a K variate normal density for 2 with mean vector b and covariance matrix 2 X X 1 The reason will be clear shortly The departure point for the Bayesian analysis of the model is the speci cation of a prior distribution This distribution gives the analyst s prior beliefs about the parameters of the model One of two approaches is generally taken If no prior information is known about the parameters then we can specify a noninformative prior that re ects that We do this by specifying a at prior for the parameter in question 3 g parameter constant
    3 That this improper density might not integrate to one is only a minor dif culty Any constant of integration

    would ultimately drop out of the nal result See Zellner 1971 pp 41 53 for a discussion of noninformative priors

    Greene 50240

    book

    June 20 2002

    18 2

    432

    CHAPTER 16 Estimation Frameworks in Econometrics

    There are different ways that one might characterize the lack of prior information The implication of a at prior is that within the range of valid values for the parameter all intervals of equal length hence in principle all values are equally likely The second possibility an informative prior is treated in the next section The posterior density is the result of combining the likelihood function with the prior density Since it pools the full set of information available to the analyst once the data have been drawn the posterior density would be interpreted the same way the prior density was before the data were obtained To begin we analyze the case in which 2 is assumed to be known This assumption is obviously unrealistic and we do so only to establish a point of departure Using Bayes Theorem we construct the posterior density f y X 2 L 2 y X g 2 L 2 y X g 2 f y

    assuming that the distribution of X does not depend on or 2 Since g 2 a constant this density is the one in 16 4 For now write f 2 y X h 2 2 K 2 2 X X 1 1 2 e 1 2 b where h 2 v s 2 v 1 1 v 1 2
    v
    2

    X X 1 1 b



    16 5

    e vs

    2

    1 2



    16 6

    For the present we treat h 2 simply as a constant that involves 2 not as a probability density 16 5 is conditional on 2 Thus the posterior density f 2 y X is proportional to a multivariate normal distribution with mean b and covariance matrix 2 X X 1 This result is familiar but it is interpreted differently in this setting First we have combined our prior information about in this case no information and the sample information to obtain a posterior distribution Thus on the basis of the sample data in hand we obtain a distribution for with mean b and covariance matrix 2 X X 1 The result is dominated by the sample information as it should be if there is no prior information In the absence of any prior information the mean of the posterior distribution which is a type of Bayesian point estimate is the sampling theory estimator To generalize the preceding to an unknown 2 we specify a noninformative prior distribution for ln over the entire real line 4 By the change of variable formula if g ln is constant then g 2 is proportional to 1 2 5 Assuming that and 2 are independent we now have the noninformative joint prior distribution g 2 g g 2 2 1 2

    4 See

    Zellner 1971 for justi cation of this prior distribution

    5 Many treatments of this model use

    rather than 2 as the parameter of interest The end results are identical We have chosen this parameterization because it makes manipulation of the likelihood function with a gamma prior distribution especially convenient See Zellner 1971 pp 44 45 for discussion

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    433

    We can obtain the joint posterior distribution for and 2 by using f 2 y X L 2 y X g 2 2 L 2 y X 1 2 16 7

    For the same reason as before we multiply g 2 2 by a well chosen constant this time v s 2 v 1 v 2 v s 2 v 1 Multiplying 16 5 by this constant times g 2 2 and inserting h 2 gives the joint posterior for and 2 given y and X f 2 y X v s 2 v 2 1 v 2 2 e 1 2 b
    v 1

    e vs

    2

    1 2

    2 K 2 2 X X 1 1 2

    2

    X X 1 1 b

    To obtain the marginal posterior distribution for it is now necessary to integrate 2 out of the joint distribution and vice versa to obtain the marginal distribution for 2 By collecting the terms f 2 y X can be written as f 2 y X A where A v s 2 v 2 2 K 2 X X 1 1 2 v 2 1 2
    P 1

    e 1
    2

    P v 2 K 2 n K 2 2 K 2 n 4 2 and v s 2 1 b X X b 2 so the marginal posterior distribution for is
    0 0 P 1

    f 2 y X d 2 A

    1 2

    e 1 d 2
    2

    To do the integration we have to make a change of variable d 1 2 1 2 2 d 2 so d 2 1 2 2 d 1 2 Making the substitution the sign of the integral changes twice once for the Jacobian and back again because the integral from 2 0 to is the negative of the integral from 1 2 0 to we obtain
    0

    f 2 y X d 2 A
    0



    1 2

    P 3

    e 1 d
    2

    1 2

    A

    P 2 P 2

    Reinserting the expressions for A P and produces v s 2 v 2 v K 2 2 K 2 X X 1 2 v 2 v s 2 1 b X X b 2
    v K 2

    f y X



    16 8

    Greene 50240

    book

    June 20 2002

    18 2

    434

    CHAPTER 16 Estimation Frameworks in Econometrics

    This density is proportional to a multivariate t distribution6 and is a generalization of the familiar univariate distribution we have used at various points This distribution has a degrees of freedom parameter d n K mean b and covariance matrix d d 2 s2 X X 1 Each element of the K element vector has a marginal distribution that is the univariate t distribution with degrees of freedom n K mean bk and variance equal to the kth diagonal element of the covariance matrix given earlier Once again this is the same as our sampling theory The difference is a matter of interpretation In the current context the estimated distribution is for and is centered at b
    16 2 2 b POINT ESTIMATION

    The posterior density function embodies the prior and the likelihood and therefore contains all the researcher s information about the parameters But for purposes of presenting results the density is somewhat imprecise and one normally prefers a point or interval estimate The natural approach would be to use the mean of the posterior distribution as the estimator For the noninformative prior we use b the sampling theory estimator One might ask at this point why bother These Bayesian point estimates are identical to the sampling theory estimates All that has changed is our interpretation of the results This situation is however exactly the way it should be Remember that we entered the analysis with noninformative priors for and 2 Therefore the only information brought to bear on estimation is the sample data and it would be peculiar if anything other than the sampling theory estimates emerged at the end The results do change when our prior brings out of sample information into the estimates as we shall see below The results will also change if we change our motivation for estimating The parameter estimates have been treated thus far as if they were an end in themselves But in some settings parameter estimates are obtained so as to enable the analyst to make a decision Consider then a loss function H which quanti es the cost of basing a decision on an estimate when the parameter is The expected or average loss is E H


    H f y X d

    16 9

    where the weighting function is the marginal posterior density The joint density for and 2 would be used if the loss were de ned over both The Bayesian point estimate is the parameter vector that minimizes the expected loss If the loss function is a quadratic form in then the mean of the posterior distribution is the minimum expected loss MELO estimator The proof is simple For this case E H y X E
    1 2

    W y X

    To minimize this we can use the result that E H y X E H y X E W y X
    6 See

    for example Judge et al 1985 for details The expression appears in Zellner 1971 p 67 Note that the exponent in the denominator is v K 2 n 2

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    435

    The minimum is found by equating this derivative to 0 whence since W is irrelevant E y X This kind of loss function would state that errors in the positive and negative direction are equally bad and large errors are much worse than small errors If the loss function were a linear function instead then the MELO estimator would be the median of the posterior distribution These results are the same in the case of the noninformative prior that we have just examined
    16 2 2 c INTERVAL ESTIMATION

    The counterpart to a con dence interval in this setting is an interval of the posterior distribution that contains a speci ed probability Clearly it is desirable to have this interval be as narrow as possible For a unimodal density this corresponds to an interval within which the density function is higher than any points outside it which justi es the term highest posterior density HPD interval For the case we have analyzed which involves a symmetric distribution we would form the HPD interval for around the least squares estimate b with terminal values taken from the standard t tables
    16 2 2 d ESTIMATION WITH AN INFORMATIVE PRIOR DENSITY

    Once we leave the simple case of noninformative priors matters become quite complicated both at a practical level and methodologically in terms of just where the prior comes from The integration of 2 out of the posterior in 16 5 is complicated by itself It is made much more so if the prior distributions of and 2 are at all involved Partly to offset these dif culties researchers usually use what is called a conjugate prior which is one that has the same form as the conditional density and is therefore amenable to the integration needed to obtain the marginal distributions 7 Suppose that we assume that the prior beliefs about may be summarized in a K variate normal distribution with mean 0 and variance matrix 0 Once again it is illuminating to begin with the case in which 2 is assumed to be known Proceeding in exactly the same fashion as before we would obtain the following result The posterior density of conditioned on 2 and the data will be normal with E 2 y X
    1 0

    2 X X 1 1

    1

    1 0 0

    2 X X 1 1 b

    F 0 I F b where F
    1 0 1

    16 10

    2 X X 1 1

    1 0 1

    prior variance 1 conditional variance 1

    prior variance 1

    7 Our choice of noninformative prior for ln

    led to a convenient prior for 2 in our derivation of the posterior for The idea that the prior can be speci ed arbitrarily in whatever form is mathematically convenient is very troubling it is supposed to represent the accumulated prior belief about the parameter On the other hand it could be argued that the conjugate prior is the posterior of a previous analysis which could justify its form The issue of how priors should be speci ed is one of the focal points of the methodological debate Non Bayesians argue that it is disingenuous to claim the methodological high ground and then base the crucial prior density in a model purely on the basis of mathematical convenience In a small sample this assumed prior is going to dominate the results whereas in a large one the sampling theory estimates will dominate anyway

    Greene 50240

    book

    June 20 2002

    18 2

    436

    CHAPTER 16 Estimation Frameworks in Econometrics

    This vector is a matrix weighted average of the prior and the least squares sample coef cient estimates where the weights are the inverses of the prior and the conditional covariance matrices 8 The smaller the variance of the estimator the larger its weight which makes sense Also still taking 2 as known we can write the variance of the posterior normal distribution as Var y X 2
    1 0

    2 X X 1 1

    1



    16 11

    Notice that the posterior variance combines the prior and conditional variances on the basis of their inverses 9 We may interpret the noninformative prior as having in nite elements in 0 This assumption would reduce this case to the earlier one Once again it is necessary to account for the unknown 2 If our prior over 2 is to be informative as well then the resulting distribution can be extremely cumbersome A conjugate prior for and 2 that can be used is g 2 g 2 2 g 2 2 where g 2 2 is normal with mean 0 and variance 2 A and g 2 2
    2 m 0 m 1 m 1

    16 12

    1 2

    m

    e m 0 1
    2 2

    16 13

    This distribution is an inverted gamma distribution It implies that 1 2 has a gamma 2 4 distribution The prior mean for 2 is 0 and the prior variance is 0 m 1 10 The product in 16 12 produces what is called a normal gamma prior which is the natural conjugate prior for this form of the model By integrating out 2 we would obtain the prior marginal for alone which would be a multivariate t distribution 11 Combining 16 12 with 16 13 produces the joint posterior distribution for and 2 Finally the marginal posterior distribution for is obtained by integrating out 2 It has been shown that this posterior distribution is multivariate t with E y X 2 A 1 2 X X 1 1 and Var y X j j 2 2 A 1 2 X X 1 1
    1 1

    2 A 1 0 2 X X 1 1 b

    16 14



    16 15

    where j is a degrees of freedom parameter and 2 is the Bayesian estimate of 2 The prior degrees of freedom m is a parameter of the prior distribution for 2 that would have been determined at the outset See the following example Once again it is clear
    that it will not follow that individual elements of the posterior mean vector lie between those of 0 and b See Judge et al 1985 pp 109 110 and Chamberlain and Leamer 1976
    8 Note

    this estimator was proposed by Theil and Goldberger 1961 as a way of combining a previously obtained estimate of a parameter and a current body of new data They called their result a mixed estimator The term mixed estimation takes an entirely different meaning in the current literature as we will see in Chapter 17
    10 You

    9 Precisely

    can show this result by using gamma integrals Note that the density is a function of 1 2 1 x in the formula of B 39 so to obtain E 2 we use the analog of E 1 x P 1 and E 1 x 2 2 2 P 1 P 2 In the density for 1 2 the counterparts to and P are m 0 and m 1 details of this lengthy derivation appear in Judge et al 1985 pp 106 110 and Zellner 1971

    11 Full

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    437

    TABLE 16 1 Years

    Estimates of the MPC
    Estimated MPC Variance of b Degrees of Freedom Estimated

    1940 1950 1950 2000

    0 6848014 0 92481

    0 061878 0 000065865

    9 49

    24 954 92 244

    that as the amount of data increases the posterior density and the estimates thereof converge to the sampling theory results
    Example 16 3 Bayesian Estimate of the Marginal Propensity to Consume

    In Example 3 2 an estimate of the marginal propensity to consume is obtained using 11 observations from 1940 to 1950 with the results shown in the top row of Table 16 1 A classical 95 percent con dence interval for based on these estimates is 0 8780 1 2818 The very wide interval probably results from the obviously poor speci cation of the model Based on noninformative priors for and 2 we would estimate the posterior density for to be univariate t with 9 degrees of freedom with mean 0 6848014 and variance 11 9 0 061878 0 075628 An HPD interval for would coincide with the con dence interval Using the fourth quarter yearly values of the 1950 2000 data used in Example 6 3 we obtain the new estimates that appear in the second row of the table We take the rst estimate and its estimated distribution as our prior for and obtain a posterior density for based on an informative prior instead We assume for this exercise that 2 may be taken as known at the sample value of 29 954 Then b 1 1 0 000065865 0 061878
    1

    0 92481 0 6848014 0 92455 0 000065865 0 061878

    The weighted average is overwhelmingly dominated by the far more precise sample estimate from the larger sample The posterior variance is the inverse in brackets which is 0 000071164 This is close to the variance of the latter estimate An HPD interval can be formed in the familiar fashion It will be slightly narrower than the con dence interval since the variance of the posterior distribution is slightly smaller than the variance of the sampling estimator This reduction is the value of the prior information As we see here the prior is not particularly informative
    16 2 2 e HYPOTHESIS TESTING

    The Bayesian methodology treats the classical approach to hypothesis testing with a large amount of skepticism Two issues are especially problematic First a close examination of only the work we have done in Chapter 6 will show that because we are using consistent estimators with a large enough sample we will ultimately reject any nested hypothesis unless we adjust the signi cance level of the test downward as the sample size increases Second the all or nothing approach of either rejecting or not rejecting a hypothesis provides no method of simply sharpening our beliefs Even the most committed of analysts might be reluctant to discard a strongly held prior based on a single sample of data yet this is what the sampling methodology mandates Note for example the uncomfortable dilemma this creates in footnote 24 in Chapter 14 The Bayesian approach to hypothesis testing is much more appealing in this regard Indeed the approach might be more appropriately called comparing hypotheses since it essentially involves only making an assessment of which of two hypotheses has a higher probability of being correct

    Greene 50240

    book

    June 20 2002

    18 2

    438

    CHAPTER 16 Estimation Frameworks in Econometrics

    The Bayesian approach to hypothesis testing bears large similarity to Bayesian estimation 12 We have formulated two hypotheses a null denoted H0 and an alternative denoted H1 These need not be complementary as in H0 statement A is true versus H1 statement A is not true since the intent of the procedure is not to reject one hypothesis in favor of the other For simplicity however we will con ne our attention to hypotheses about the parameters in the regression model which often are complementary Assume that before we begin our experimentation data gathering statistical analysis we are able to assign prior probabilities P H0 and P H1 to the two hypotheses The prior odds ratio is simply the ratio Oddsprior P H0 P H1 16 16

    For example one s uncertainty about the sign of a parameter might be summarized in a prior odds over H0 0 versus H1 0 of 0 5 0 5 1 After the sample evidence is gathered the prior will be modi ed so the posterior is in general Oddsposterior B01 Oddsprior The value B01 is called the Bayes factor for comparing the two hypotheses It summarizes the effect of the sample data on the prior odds The end result Oddsposterior is a new odds ratio that can be carried forward as the prior in a subsequent analysis The Bayes factor is computed by assessing the likelihoods of the data observed under the two hypotheses We return to our rst departure point the likelihood of the data given the parameters f y 2 X 2 2 n 2 e 1 2
    2

    y X y X



    16 17

    Based on our priors for the parameters the expected or average likelihood assuming that hypothesis j is true j 0 1 is f y X Hj E 2 f y 2 X Hj
    2

    f y 2 X Hj g 2 d d 2

    This conditional density is also the predictive density for y Therefore based on the observed data we use Bayes theorem to reassess the probability of Hj the posterior probability is f y X Hj P Hj P Hj y X f y The posterior odds ratio is P H0 y X P H1 y X so the Bayes factor is B01
    Example 16 4

    f y X H0 f y X H1

    Posterior Odds for the Classical Regression Model

    Zellner 1971 analyzes the setting in which there are two possible explanations for the variation in a dependent variable y Model 0 y x0 0 0 and Model 1 y x1 1 1
    12 For

    extensive discussion see Zellner and Siow 1980 and Zellner 1985 pp 275 305

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    439

    We will brie y sketch his results We form informative priors for 2 j j 0 1 as speci ed in 16 12 and 16 13 that is multivariate normal and inverted gamma respectively Zellner then derives the Bayes factor for the posterior odds ratio The derivation is lengthy and complicated but for large n with some simplifying assumptions a useful formulation 2 2 emerges First assume that the priors for 0 and 1 are the same Second assume that A 1 A 1 X0 X0 A 1 A 1 X1 X1 1 The rst of these would be the usual situation 0 0 1 1 in which the uncertainty concerns the covariation between yi and xi not the amount of residual variation lack of t The second concerns the relative amounts of information in the prior A versus the likelihood X X These matrices are the inverses of the covariance matrices or the precision matrices Note how these two matrices form the matrix weights in the computation of the posterior mean in 16 10 Zellner p 310 discusses this assumption at some length With these two assumptions he shows that as n grows large 13 B01
    2 s0 2 s1 n m 2



    2 1 R0 2 1 R1

    n m 2



    Therefore the result favors the model that provides the better t using R 2 as the t measure If we stretch Zellner s analysis a bit by interpreting model 1 as the model and model 0 as 2 no model i e the relevant part of 0 0 so R0 0 then the ratio simpli es to
    2 B01 1 R0 n m 2



    Thus the better the t of the regression the lower the Bayes factor in favor of model 0 no model which makes intuitive sense Zellner and Siow 1980 have continued this analysis with noninformative priors for and j2 Speci cally they use the at prior for ln see 16 7 and a multivariate Cauchy prior which has in nite variances for Their main result 3 10 is k 2 1 n K 2 1 R 2 n K 1 2 B01 k 1 2 2 This result is very much like the previous one with some slight differences due to degrees of freedom corrections and the several approximations used to reach the rst one
    16 2 3 USING BAYES THEOREM IN A CLASSICAL ESTIMATION PROBLEM THE LATENT CLASS MODEL

    Latent class modeling can be viewed as a means of modeling heterogeneity across individuals in a random parameters framework We rst encountered random parameters models in Section 13 8 in connection with panel data 14 As we shall see the latent class model provides an interesting hybrid of classical and Bayesian analysis To de ne the latent class model we begin with a random parameters formulation of the density of an observed random variable We will assume that the data are a panel Thus the density of yit when the parameter vector is i is f yit xit i The parameter vector i is randomly distributed over individuals according to i zi vi 16 18

    and where zi is the mean of the distribution which depends on time invariant individual characteristics as well as parameters yet to be estimated and the random
    13 A

    ratio of exponentials that appears in Zellner s result his equation 10 50 is omitted To the order of approximation in the result this ratio vanishes from the nal result Personal correspondence from A Zellner to the author principle the latent class model does not require panel data but practical experience suggests that it does work best when individuals are observed more than once and is dif cult to implement in a cross section

    14 In

    Greene 50240

    book

    June 20 2002

    18 2

    440

    CHAPTER 16 Estimation Frameworks in Econometrics

    variation comes from the individual heterogeneity vi This random vector is assumed to have mean zero and covariance matrix The conditional density of the parameters is g i zi g vi zi where g is the underlying marginal density of the heterogeneity The unconditional density for yit is obtained by integrating over vi f yit xit zi E i f yit xit i
    vi

    f yit xit i g vi

    zi dvi

    This result would provide the density that would enter the likelihood function for estimation of the model parameters We will return to this model formulation in Chapter 17 The preceding has assumed i has a continuous distribution Suppose that i is generated from a discrete distribution with J values or classes so that the distribution of is over these J vectors 15 Thus the model states that an individual belongs to one of the J latent classes but it is unknown from the sample data exactly which one We will use the sample data to estimate the probabilities of class membership The corresponding model formulation is now
    J

    f yit xit zi
    j 1

    pi j zi f yit xit j

    where it remains to parameterize the class probabilities pi j and the structural model f yit xit j The matrix contains the parameters of the discrete distribution It has J rows one for each class and M columns for the M variables in zi The structural mean and variance parameters and are no longer necessary At a minimum M 1 and zi contains a constant if the class probabilities are xed parameters Finally in order to accommodate the panel data nature of the sampling situation we suppose that conditioned on j observations yit t 1 T are independent Therefore for a group of T observations the joint density is
    T

    f yi 1 yi 2 yi T j xi 1 xi 2 xi T
    t 1

    f yit xit j

    We will consider models that provide correlation across observations in Chapters 17 and 21 Inserting this result in the earlier density produces the likelihood function for a panel of data
    n M T

    ln L
    i 1

    ln
    j 1

    pi j zi
    t 1

    g yit xit j

    The class probabilities must be constrained to sum to 1 A simple approach is to reparameterize them as a set of logit probabilities pi j e i j
    J i j j 1 e



    j 1 J

    i J 0

    i j j zi J 0

    16 19

    See Section 21 8 for development of this model for a set of probabilities Note the restriction on i J This is an identi cation restriction Without it the same set of
    15 One

    can view this as a discrete approximation to the continuous distribution This is also an extension of Heckman and Singer s 1984b model of latent heterogeneity but the interpretation is a bit different here

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    441

    probabilities will arise if an arbitrary vector is added to every j The resulting log likelihood is a continuous function of the parameters 1 J and 1 J For all its apparent complexity estimation of this model by direct maximization of the log likelihood is not especially dif cult See Section E 5 and Greene 2001 The number of classes that can be identi ed is likely to be relatively small on the order of ve or less however which is viewed as a drawback of this approach and in general as might be expected the less rich is the panel data set in terms of cross group variation the more dif cult it is to estimate this model Estimation produces values for the structural parameters j j j 1 J With these in hand we can compute the prior class probabilities pi j using 16 20 For prediction purposes one might be more interested in the posterior on the data class probabilities which we can compute using Bayes theorem as Prob class j observation i f observation i class j Prob class j
    J j 1

    f observation i class j Prob class j f yi 1 yi 2 yi T xi 1 xi 2 xi T j pi j zi

    f yi 1 yi 2 yi T xi 1 xi 2 xi T j pi j zi
    M j 1

    wi j This set of probabilities wi wi 1 wi 2 wi J gives the posterior density over the distribution of values of that is 1 2 J The Bayesian estimator of the individual speci c parameter vector would be the posterior mean p i E j j observation i
    Example 16 5
    J

    wi j j
    j 1

    Applications of the Latent Class Model

    The latent class formulation has provided an attractive platform for modeling latent heterogeneity See Greene 2001 for a survey For two examples Nagin and Land 1993 employed the model to study age transitions through stages of criminal careers and Wang et al 1998 and Wedel et al 1993 and used the Poisson regression model to study counts of patents To illustrate the estimator we will apply the latent class model to the panel data binary choice application of rm product innovations studied by Bertschek and Lechner 1998 16 They analyzed the dependent variable yi t 1 if rm i realized a product innovation in year t and 0 if not Thus this is a binary choice model See Section 21 2 for analysis of binary choice models The sample consists of 1270 German manufacturing rms observed for ve years 1984 1988 Independent variables in the model that we formulated were xi t 1 constant xi t 2 log of sales xi t 3 relative size ratio of employment in business unit to employment in the industry xi t 4 ratio of industry imports to industry sales imports xi t 5 ratio of industry foreign direct investment to industry sales imports
    16 We

    are grateful to the authors of this study who have generously loaned us their data for this analysis The data are proprietary and cannot be made publicly available as are the other data sets used in our examples

    Greene 50240

    book

    June 20 2002

    18 2

    442

    CHAPTER 16 Estimation Frameworks in Econometrics

    TABLE 16 2

    Estimated Latent Class Model
    Probit Class 1 Class 2 Class 3 Posterior

    Constant lnSales Rel Size Import FDI Prod RawMtls Invest ln L Class Prob Prior Class Prob Posterior Pred Count

    1 96 0 23 0 18 0 022 1 07 0 14 1 13 0 15 2 85 0 40 2 34 0 72 0 28 0 081 0 19 0 039 4114 05

    2 32 0 59 0 32 0 061 4 38 0 89 0 94 0 37 2 20 1 16 5 86 2 70 0 11 0 24 0 13 0 11 0 469 0 0352 0 469 0 394 649

    2 71 8 97 0 69 2 20 0 23 0 57 0 072 0 18 0 72 1 42 0 37 0 76 2 26 3 12 0 53 1 38 2 81 8 37 1 11 1 93 7 70 0 91 4 69 6 76 0 60 0 86 0 42 0 70 0 41 0 47 0 12 0 26 3503 55 0 331 0 200 0 0333 0 0246 0 331 0 200 0 289 0 325 366 255

    3 38 2 14 0 34 0 09 2 58 1 30 1 81 0 74 3 63 1 98 5 48 1 78 0 08 0 37 0 29 0 13

    xi t 6 productivity ratio of industry value added to industry employment xi t 7 dummy variable indicating rm is in the raw materials sector xi t 8 dummy variable indicating rm is in the investment goods sector Discussion of the data set may be found in the article pp 331 332 and 370 Our central model for the binary outcome is a probit model f yi t xi t j Prob yi t xi t j 2 yi t 1 xi t j yi t 0 1

    This is the speci cation used by the authors We have retained it so we can compare the results of the various models We also t a model with year speci c dummy variables instead of a single constant and with the industry sector dummy variables moved to the latent class probability equation See Greene 2002 for analysis of the different speci cations Estimates of the model parameters are presented in Table 16 2 The probit coef cients in the rst column are those presented by Bertschek and Lechner 17 The class speci c parameter estimates cannot be compared directly as the models are quite different The estimated posterior mean shown which is comparable to the one class results is the sample average and standard deviation of the 1 270 rm speci c posterior mean parameter vectors They differ considerably from the probit model but in each case a con dence interval around the posterior mean contains the probit estimator Finally the identical prior and average of the sample posterior class probabilities are shown at the bottom of the table The much larger empirical standard deviations re ect that the posterior estimates are based on aggregating the sample data and involve as well complicated functions of all the model parameters The estimated numbers of class members are computed by assigning to each rm the predicted
    17 The

    authors used the robust sandwich estimator for the standard errors see Section 17 9 rather than the conventional negative inverse of the Hessian

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    443

    class associated with the highest posterior class probability Finally to explore the difference between the probit model and the latent class model we have computed the probability of a product innovation at the ve year mean of the independent variables for each rm using the probit estimates and the rm speci c posterior mean estimated coef cient vector The two kernel density estimates shown in Figures 16 1 and 16 2 see Section 16 4 1 show the effect of allowing the greater between rm variation in the coef cient vectors
    FIGURE 16 1 Probit Probabilities

    Kernel Density Estimate for PPR 3 30

    2 64

    Density

    1 98

    1 32

    0 66

    0 00 0

    2

    4

    6 PPR

    8

    1 0

    1 2

    FIGURE 16 2

    Latent Class Probabilities

    Kernel Density Estimate for PLC 1 60

    1 28

    Density

    0 96

    0 64

    0 32

    0 00 2

    0

    2

    4 PLC

    6

    8

    1 0

    1 2

    Greene 50240

    book

    June 20 2002

    18 2

    444

    CHAPTER 16 Estimation Frameworks in Econometrics 16 2 4 HIERARCHICAL BAYES ESTIMATION OF A RANDOM PARAMETERS MODEL BY MARKOV CHAIN MONTE CARLO SIMULATION

    We now consider a Bayesian approach to estimation of the random parameters model in 16 19 For an individual i the conditional density for the dependent variable in period t is f yit xit i where i is the individual speci c K 1 parameter vector and xit is individual speci c data that enter the probability density 18 For the sequence of T observations assuming conditional on i independence person i s contribution to the likelihood for the sample is
    T

    f yi Xi i
    t 1

    f yit xit i

    16 20

    where yi yi 1 yi T and Xi xi 1 xi T We will suppose that i is distributed normally with mean and covariance matrix This is the hierarchical aspect of the model The unconditional density would be the expected value over the possible values of i
    T

    f yi Xi



    i t 1

    f yit xit i K i d i

    16 21

    where K i denotes the K variate normal prior density for i given and Maximum likelihood estimation of this model which entails estimation of the deep parameters then estimation of the individual speci c parameters i using the same method we used for the latent class model is considered in Section 17 8 For now we consider the Bayesian approach to estimation of the parameters of this model To approach this from a Bayesian viewpoint we will assign noninformative prior densities to and As is conventional we assign a at noninformative prior to The variance parameters are more involved If it is assumed that the elements of i are conditionally independent then each element of the now diagonal matrix may be assigned the inverted gamma prior that we used in 16 14 A full matrix is handled by assigning to an inverted Wishart prior density with parameters scalar K and matrix K I The Wishart density is a multivariate counterpart to the Chi squared distribution Discussion may be found in Zellner 1971 pp 389 394 This produces the joint posterior density
    n T

    1 n

    all data
    i 1 t 1

    f yit xit i K i

    p

    16 22 This gives the joint density of all the unknown parameters conditioned on the observed data Our Bayesian estimators of the parameters will be the posterior means for these n 1 K K K 1 2 parameters In principle this requires integration of 16 23 with respect to the components As one might guess at this point that integration is hopelessly complex and not remotely feasible It is at this point that the recently
    18 In order to avoid a layer of complication we will embed the time invariant effect

    zi in xi t A full treatment in the same fashion as the latent class model would be substantially more complicated in this setting though it is quite straightforward in the maximum simulated likelihood approach discussed in Section 17 8

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    445

    developed techniques of Markov Chain Monte Carlo MCMC simulation estimation and the Metropolis Hastings algorithm enter and enable us to do the estimation in a remarkably simple fashion The MCMC procedure makes use of a result that we have employed at many points in the preceding chapters The joint density in 16 23 is exceedingly complex and brute force integration is not feasible Suppose however that we could draw random samples of 1 n from this population Then sample statistics such as means computed from these random draws would converge to the moments of the underlying population The laws of large numbers discussed in Appendix D would apply That partially solves the problem The distribution remains as complex as before however so how to draw the sample remains to be solved The Gibbs sampler and the Metropolis Hastings algorithm can be used for sampling from the hopelessly complex joint density 1 n all data The basic principle of the Gibbs sampler is described in Section E2 6 The core result is as follows For a two variable case f x y in which f x y and f y x are known A Gibbs sequence of draws y0 x0 y1 x1 y2 yM xM is generated as follows First y0 is speci ed manually Then x0 is obtained as a random draw from the population f x y0 Then y1 is drawn from f y x0 and so on The iteration is generically as follows 1 2 3 Draw x j from f x y j Draw y j 1 from f y x j Exit or return to step 1

    If this process is repeated enough times then at the last step x j y j together are a draw from the joint distribution Train 2001 and 2002 Chapter 12 describes how to use these results for this random parameters model 19 The usefulness of this result for our current problem is that it is indeed possible to partition the joint distribution and we can easily sample from the conditional distributions We begin by partitioning the parameters into and 1 n Train proposes the following strategy To obtain a draw from we will use the Gibbs sampler to obtain a draw from the distribution of then one from the distribution of We will lay this out rst then turn to sampling from Conditioned on and has a K variate normal distribution with mean 1 n in 1 i and covariance matrix 1 n To sample from this distribution we will rst obtain the Cholesky factorization of LL where L is a lower triangular matrix See Section A 7 11 Let v be a vector of K draws from the standard normal distribution Then Lv has mean vector L 0 and covariance matrix LIL which is exactly what we need So this shows how to sample a draw from the conditional distribution of To obtain a random draw from the distribution of we will require a random draw from the inverted Wishart distribution The marginal posterior distribution of is inverted Wishart with parameters scalar K n and matrix W KI nV
    19 Train

    describes use of this method for mixed logit models By writing the densities in generic form we have extended his result to any general setting that involves a parameter vector in the fashion described above In Section 17 8 we will apply this model to the probit model considered in the latent class model in Example 16 5

    Greene 50240

    book

    June 20 2002

    18 2

    446

    CHAPTER 16 Estimation Frameworks in Econometrics

    where V 1 n in 1 i i Train 2001 suggests the following strategy for sampling a matrix from this distribution Let M be the lower triangular Cholesky factor of W 1 so MM W 1 Obtain K n draws of vk K standard normal variates K Then obtain S M k 1n vkvk M Then j S 1 is a draw from the inverted Wishart distribution This is fairly straightforward as it involves only random sampling from the standard normal distribution For a diagonal matrix that is uncorrelated parameters in i it simpli es a bit further A draw for the nonzero kth diagonal element can be K 2 obtained using 1 nVkk r 1n vr k The dif cult step is sampling i For this step we use the Metropolis Hastings M H algorithm suggested by Chib and Greenberg 1996 and Gelman et al 1995 The procedure involves the following steps 1 Given and and tuning constant to be described below compute d Lv where L is the Cholesky factorization of and v is a vector of K independent standard normal draws Create a trial value i 1 i 0 d where i 0 is the previous value The posterior distribution for i is the likelihood that appears in 16 21 times the joint normal prior density K i Evaluate this posterior density at the trial value i 1 and the previous value i 0 Let R10 4 5 f yi Xi i 1 K i 1 f yi Xi i 0 K i 0

    2 3

    Draw one observation u from the standard uniform distribution U 0 1 If u R10 then accept the trial new draw Otherwise reuse the old one

    This M H iteration converges to a sequence of draws from the desired density Overall then the algorithm uses the Gibbs sampler and the Metropolis Hastings algorithm to produce the sequence of draws for all the parameters in the model The sequence is repeated a large number of times to produce each draw from the joint posterior distribution The entire sequence must then be repeated N times to produce the sample of N draws which can then be analyzed for example by computing the posterior mean Some practical details remain The tuning constant is used to control the iteration A smaller increases the acceptance rate But at the same time a smaller makes new draws look more like old draws so this slows slows down the process Gelman et al 1995 suggest 0 4 for K 1 and smaller values down to about 0 23 for higher dimensions as will be typical Each multivariate draw takes many runs of the MCMC sampler The process must be started somewhere though it does not matter much where Nonetheless a burn in period is required to eliminate the in uence of the starting value Typical applications use several draws for this burn in period for each run of the sampler How many sample observations are needed for accurate estimation is not certain though several hundred would be a minimum This means that there is a huge amount of computation done by this estimator However the computations are fairly simple The only complicated step is computation of the acceptance criterion at Step 3 of the M H iteration Depending on the model this may like the rest of the calculations be quite simple Uses of this methodology can be found in many places in the literature It has been particularly productive in marketing research for example in analyzing discrete

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    447

    choice such as brand choice The cost is in the amount of computation which is large Some important quali cations As we have hinted before in Bayesian estimation as the amount of sample information increases it eventually dominates the prior density even if it is informative so long as it is proper and has nite moments The Bernstein von Mises Theorem Train p 5 gives formal statements of this result but we can summarize it with Bickel and Doksum s 2000 version which observes that the asymptotic sampling distribution of the posterior mean is the same as the asymptotic distribution of the maximum likelihood estimator The practical implication of this for us is that if the sample size is large the Bayesian estimator of the parameters described here and the maximum likelihood estimator described in Section 17 9 will give the same answer 20

    16 3

    SEMIPARAMETRIC ESTIMATION

    Semiparametric estimation is based on fewer assumptions than parametric estimation In general the distributional assumption is removed and an estimator is devised from certain more general characteristics of the population Intuition suggests two correct conclusions First the semiparametric estimator will be more robust than the parametric estimator it will retain its properties notably consistency across a greater range of speci cations Consider our most familiar example The least squares slope estimator is consistent whenever the data are well behaved and the disturbances and the regressors are uncorrelated This is even true for the frontier function in Example 16 2 which has an asymmetric nonnormal disturbance But second this robustness comes at a cost The distributional assumption usually makes the preferred estimator more ef cient than a robust one The best robust estimator in its class will usually be inferior to the parametric estimator when the assumption of the distribution is correct Once again in the frontier function setting least squares may be robust for the slopes and it is the most ef cient estimator that uses only the orthogonality of the disturbances and the regressors but it will be inferior to the maximum likelihood estimator when the two part normal distribution is the correct assumption
    16 3 1 GMM ESTIMATION IN ECONOMETRICS

    Recent applications in economics include many that base estimation on the method of moments The generalized method of moments departs from a set of model based moment equations E m yi xi 0 where the set of equations speci es a relationship known to hold in the population We used one of these in the preceding paragraph The least squares estimator can be motivated by noting that the essential assumption is that E xi yi xi 0 The estimator is obtained by seeking a parameter estimator b which mimics the population result 1 n i xi yi xi b 0 This is of course the
    20 Practitioners

    might note recent developments in commercial software have produced a wide choice of mixed estimators which are various implementations of the maximum likelihood procedures and hierarchical Bayes procedures such as the Sawtooth program 1999 Unless one is dealing with a small sample the choice between these can be based on convenience There is little methodological difference This returns us to the practical point noted earlier The choice between the Bayesian approach and the sampling theory method in this application would not be based on a fundamental methodological criterion but on purely practical considerations the end result is the same

    Greene 50240

    book

    June 20 2002

    18 2

    448

    CHAPTER 16 Estimation Frameworks in Econometrics

    normal equations for least squares Note that the estimator is speci ed without bene t of any distributional assumption Method of moments estimation is the subject of Chapter 18 so we will defer further analysis until then
    16 3 2 LEAST ABSOLUTE DEVIATIONS ESTIMATION

    Least squares can be severely distorted by outlying observations Recent applications in microeconomics and nancial economics involving thick tailed disturbance distributions for example are particularly likely to be affected by precisely these sorts of observations Of course in those applications in nance involving hundreds of thousands of observations which are becoming commonplace all this discussion is moot These applications have led to the proposal of robust estimators that are unaffected by outlying observations 21 In this section we will examine one of these the least absolute deviations or LAD estimator That least squares gives such large weight to large deviations from the regression causes the results to be particularly sensitive to small numbers of atypical data points when the sample size is small or moderate The least absolute deviations LAD estimator has been suggested as an alternative that remedies at least to some degree the problem The LAD estimator is the solution to the optimization problem
    n

    Minb0
    i 1

    yi xi b0

    The LAD estimator s history predates least squares which itself was proposed over 200 years ago It has seen little use in econometrics primarily for the same reason that Gauss s method LS supplanted LAD at its origination LS is vastly easier to compute Moreover in a more modern vein its statistical properties are more rmly established than LAD s and samples are usually large enough that the small sample advantage of LAD is not needed The LAD estimator is a special case of the quantile regression Prob yi xi q The LAD estimator estimates the median regression That is it is the solution to the quantile regression when q 0 5 Koenker and Bassett 1978 1982 Huber 1967 and Rogers 1993 have analyzed this regression 22 Their results suggest an estimator for the asymptotic covariance matrix of the quantile regression estimator Est Asy Var bq X X 1 X DX X X 1 where D is a diagonal matrix containing weights di q f 0
    2

    if yi xi is positive and

    1 q f 0

    2

    otherwise

    21 For some applications see Taylor 1974 Amemiya 1985

    pp 70 80 Andrews 1974 Koenker and Bassett 1978 and a survey written at a very accessible level by Birkes and Dodge 1993 A somewhat more rigorous treatment is given by Hardle 1990 1984 has extended the LAD estimator to produce a robust estimator for the case in which data on the dependent variable are censored that is when negative values of yi are recorded as zero See Section 22 3 4c for discussion and Melenberg and van Soest 1996 for an application For some related results on other semiparametric approaches to regression see Butler McDonald Nelson and White 1990 and McDonald and White 1993

    22 Powell

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    449

    and f 0 is the true density of the disturbances evaluated at 0 23 It remains to obtain an estimate of f 0 There is one useful symmetry in this result Suppose that the true density were normal with variance 2 Then the preceding would reduce to 2 2 X X 1 which is the result we used in Example E 1 to compare estimates of the median and the mean in a simple situation of random sampling For more general cases some other empirical estimate of f 0 is going to be required Nonparametric methods of density estimation are available see Section 16 4 and e g Johnston and DiNardo 1997 pp 370 375 But for the small sample situations in which techniques such as this are most desirable our application below involves 25 observations nonparametric kernel density estimation of a single ordinate is optimistic these are after all asymptotic results But asymptotically as suggested by Example E 1 the results begin overwhelmingly to favor least squares For better or worse a convenient estimator would be a kernel density estimator as described in Section 16 4 1 Looking ahead the computation would be f 0 1 n
    n i 1

    1 ei K h h

    where h is the bandwidth to be discussed below K is a weighting or kernel function and ei i 1 n is the set of residuals There are no hard and fast rules for choosing h one popular choice is that used by Stata h 9s n1 5 The kernel function is likewise discretionary though it rarely matters much which one chooses the logit kernel see Table 16 4 is a common choice The bootstrap method of inferring statistical properties is well suited for this application Since the ef cacy of the bootstrap has been established for this purpose the search for a formula for standard errors of the LAD estimator is not really necessary The bootstrap estimator for the asymptotic covariance matrix can be computed as follows Est Var bLAD 1 R
    R

    bLAD r bLAD bLAD r bLAD
    r 1

    where bLAD is the LAD estimator and bLAD r is the rth LAD estimate of based on a sample of n observations drawn with replacement from the original data set
    Example 16 6 LAD Estimation of a Cobb Douglas Production Function

    Zellner and Revankar 1970 proposed a generalization of the Cobb Douglas production function which allows economies of scale to vary with output Their statewide data on Y value added output K capital L labor and N the number of establishments in the transportation industry are given in Appendix Table F9 2 The generalized model is estimated in Example 17 9 For this application estimates of the Cobb Douglas production function ln Yi Ni 1 2 ln K i Ni 3 ln L i Ni i are obtained by least squares and LAD The standardized least squares residuals see Section 4 9 3 suggest that two observations Florida and Kentucky are outliers by the usual
    23 See

    Stata 2001 Koenker suggests that for independent and identically distributed observations one should replace di with the constant a q 1 q f F 1 q 2 25 f 0 2 for the median LAD estimator This reduces the expression to the true asymptotic covariance matrix a X X 1 The one given is a sample estimator which will behave the same in large samples Personal communication to the author

    Greene 50240

    book

    June 20 2002

    18 2

    450

    CHAPTER 16 Estimation Frameworks in Econometrics

    TABLE 16 3

    LS and LAD Estimates of a Production Function
    LAD Bootstrap Estimate Std Error t Ratio Kernel Density Std Error t Ratio

    Least Squares Standard Coef cient Estimate Error t Ratio

    Constant k l e2 e

    1 844 0 245 0 805 1 2222 4 0008

    0 234 0 107 0 126

    7 896 2 297 6 373

    1 806 0 205 0 849 1 2407 3 9927

    0 344 0 128 0 163

    5 244 1 597 5 201

    0 320 0 147 0 173

    5 639 1 398 4 903

    construction The least squares coef cient vectors with and without these two observations are 1 844 0 245 0 805 and 1 764 0 209 0 852 respectively which bears out the suggestion that these two points do exert considerable in uence Table 16 3 presents the LAD estimates of the same parameters with standard errors based on 500 bootstrap replications The LAD estimates with and without these two observations are identical so only the former are presented Using the simple approximation of multiplying the corresponding OLS standard error by 2 1 2 1 2533 produces a surprisingly close estimate of the bootstrap estimated standard errors for the two slope parameters 0 134 0 158 compared with the bootstrap estimates of 0 128 0 163 The second set of estimated standard errors are 2 based on Koenker s suggested estimator 25 f 0 25 1 54672 0 104502 The bandwidth and kernel function are those suggested earlier The results are surprisingly consistent given the small sample size
    16 3 3 PARTIALLY LINEAR REGRESSION

    The proper functional form in the linear regression is an important speci cation issue We examined this in detail in Chapter 7 Some approaches including the use of dummy variables logs quadratics and so on were considered as means of capturing nonlinearity The translog model in particular Example 2 4 is a well known approach to approximating an unknown nonlinear function Even with these approaches the researcher might still be interested in relaxing the assumption of functional form in the model The partially linear model analyzed in detail by Yatchew 1998 2000 is another approach Consider a regression model in which one variable x is of particular interest and the functional form with respect to x is problematic Write the model as yi f xi zi i where the data are assumed to be well behaved and save for the functional form the assumptions of the classical model are met The function f xi remains unspeci ed As stated estimation by least squares is not feasible until f xi is speci ed Suppose the data were such that they consisted of pairs of observations y j 1 y j 2 j 1 n 2 in which x j 1 x j 2 within every pair If so then estimation of could be based on the simple transformed model y j 2 y j 1 z j 2 z j 1 j 2 j 1 j 1 n 2

    As long as observations are independent the constructed disturbances vi still have zero mean variance now 2 2 and remain uncorrelated across pairs so a classical model applies and least squares is actually optimal Indeed with the estimate of say d in

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    451

    hand a noisy estimate of f xi could be estimated with yi zi d the estimate contains 24 the estimation error as well as vi The problem of course is that the enabling assumption is heroic Data would not behave in that fashion unless they were generated experimentally The logic of the partially linear regression estimator is based on this observation nonetheless Suppose that the observations are sorted so that x1 x2 xn Suppose as well that this variable is well behaved in the sense that as the sample size increases this sorted data vector more tightly and uniformly lls the space within which xi is assumed to vary Then intuitively the difference is almost right and becomes better as the sample size grows Yatchew 1997 1998 goes more deeply into the underlying theory A theory is also developed for a better differencing of groups of two or more observations The M M M 2 transformed observation is yd i m 0 dm yi m where m 0 dm 0 and m 0 dm 1 The data are not separated into nonoverlapping groups for this transformation we merely used that device to motivate the technique The pair of weights for M 1 is obviously 5 this is just a scaling of the simple difference 1 1 Yatchew 1998 p 697 tabulates optimal differencing weights for M 1 10 The values for M 2 are 0 8090 0 500 0 3090 and for M 3 are 0 8582 0 3832 0 2809 0 1942 This estimator is shown to be consistent asymptotically normally distributed and have asymptotic covariance matrix Asy Var d 1
    2 1 v Ex Var z x 25 2M n

    The matrix can be estimated using the sums of squares and cross products of the differenced data The residual variance is likewise computed with v 2
    n i M 1 yd i

    zd i d 2 n M

    Yatchew suggests that the partial residuals yd i zd i d be smoothed with a kernel density estimator to provide an improved estimator of f xi
    Example 16 7 Partially Linear Translog Cost Function

    Yatchew 1998 2000 applied this technique to an analysis of scale effects in the costs of electricity supply The cost function following Nerlove 1963 and Christensen and Greene 1976 was speci ed to be a translog model see Example 2 4 and Section 14 3 2 involving labor and capital input prices other characteristics of the utiity and the variable of interest the number of customers in the system C We will carry out a similar analysis using Christenen and Greene s 1970 electricity supply data The data are given in Appendix Table F5 2 See Section 14 3 1 for description of the data There are 158 observations in the data set but the last 35 are holding companies which are comprised of combinations of the others In addition there are several extremely small New England utilities whose costs are clearly unrepresentative of the best practice in the industry We have done the analysis using rms 6 123 in the data set Variables in the data set include Q output C total cost and PK PL and PF unit cost measures for capital labor and fuel respectively The parametric model speci ed is a restricted version of the Christensen and Greene model ln c 1 k 2 l 3 q 4 q 2 2 5
    24 See

    Estes and Honore 1995 who suggest this approach with simple differencing of the data 2000 p 191 denotes this covariance matrix E Cov z x

    25 Yatchew

    Greene 50240

    book

    June 20 2002

    18 2

    452

    CHAPTER 16 Estimation Frameworks in Econometrics

    Nonparametric Regression for Fitted Cost 14 E Cost Q 12 10 E Cost Q 8 6 4 2 0 0 5000 10000 15000 OUTPUT 20000 25000
    Fitted Cost

    FIGURE 16 3

    Smoothed Estimator for Costs

    where c ln C Q PF k ln PK PF l ln PL PF and q ln Q The partially linear model substitutes f Q for the last three terms The division by PF ensures that average cost is homogeneous of degree one in the prices a theoretical necessity The estimated equations with estimated standard errors are shown below parametric c 6 83 0 168k 0 146l 0 590q 0 061q2 2 0 353 0 042 0 048 0 075 0 010 s 0 13383 0 170kd 0 127l d f Q v 0 049 0 057 s 0 14044

    partial linear cd

    Yatchew s suggested smoothed kernel density estimator for the relationship between average cost and output is shown in Figure 16 3 with the unsmoothed partial residuals We nd as did Christensen and Greene in the earlier study that in the relatively low ranges of output there is a fairly strong relationship between scale and average cost
    16 3 4 Kernel Density Methods

    The kernel density estimator is an inherently nonparametric tool so it ts more appropriately into the next section But some models which use kernel methods are not completely nonparametric The partially linear model in the preceding example is a case in point Many models retain an index function formulation that is build the speci cation around a linear function x which makes them at least semiparametric but nonetheless still avoid distributional assumptions by using kernel methods Lewbel s 2000 estimator for the binary choice model is another example
    Example 16 8 Semiparametric Estimator for Binary Choice Models

    The core binary choice model analyzed in Example 16 5 the probit model is a fully parametric speci cation Under the assumptions of the model maximum likelihood is the ef cient and appropriate estimator However as documented in a voluminous literature the estimator

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    453

    of is fragile with respect to failures of the distributional assumption We will examine a few semiparametric and nonparametric estimators in Section 21 5 To illustrate the nature of the modeling process we consider an estimator recently suggested by Lewbel 2000 The probit model is based on the normal distribution with Prob yi 1 Prob xi i 0 where i N 0 1 The estimator of under this speci cation will be inconsistent if the distribution is not normal or if i is heteroscedastic Lewbel suggests the following If a it can be assumed that xi contains a special variable vi whose coef cient has a known sign a method is developed for determining the sign and b the density of i is independent of this variable then a consistent estimator of can be obtained by linear regression of yi s vi f vi xi on xi where s vi 1 if vi 0 and 0 otherwise and f vi xi is a kernel density estimator of the density of vi xi Lewbel s estimator is robust to heteroscedasticity and distribution A method is also suggested for estimating the distribution of i Note that Lewbel s estimator is semiparametric His underlying model is a function of the parameters but the distribution is unspeci ed

    16 4

    NONPARAMETRIC ESTIMATION

    Researchers have long held reservations about the strong assumptions made in parametric models t by maximum likelihood The linear regression model with normal disturbances is a leading example Splines translog models and polynomials all represent attempts to generalize the functional form Nonetheless questions remain about how much generality can be obtained with such approximations The techniques of nonparametric estimation discard essentially all xed assumptions about functional form and distribution Given their very limited structure it follows that nonparametric speci cations rarely provide very precise inferences The bene t is that what information is provided is extremely robust The centerpiece of this set of techniques is the kernel density estimator that we have used in the preceding examples We will examine some examples then examine an application to a bivariate regression 26
    16 4 1 KERNEL DENSITY ESTIMATION

    Sample statistics such as a mean variance and range give summary information about the values that a random variable may take But they do not suf ce to show the distribution of values that the random variable takes and these may be of interest as well The density of the variable is used for this purpose A fully parametric approach to density estimation begins with an assumption about the form of a distribution Estimation of the density is accomplished by estimation of the parameters of the distribution To take the canonical example if we decide that a variable is generated by a normal distribution with mean and variance 2 then the density is fully characterized by these parameters It follows that f x f x 2 11 1 exp 2 2 x
    2



    One may be unwilling to make a narrow distributional assumption about the density The usual approach in this case is to begin with a histogram as a descriptive device Consider
    26 There

    is a large and rapidly growing literature in this area of econometrics Two major references which provide an applied and theoretical foundation are Hardle 1990 and Pagan and Ullah 1999

    Greene 50240

    book

    June 20 2002

    18 2

    454

    CHAPTER 16 Estimation Frameworks in Econometrics

    Histogram for Variable BSALES 324

    243

    Frequency

    162

    81

    0 236 283 330 377 424 BSALES 471 518 565

    FIGURE 16 4

    Histogram for Estimated Coef cients

    an example In Example 16 5 we estimated a model that produced a posterior estimator of a slope vector for each of the 1 270 rms in our sample We might be interested in the distribution of these estimators across rms In particular the posterior estimates of the estimated slope on lnsales for the 1 270 rms have a sample mean of 0 3428 a standard deviation of 0 08919 a minimum of 0 2361 and a maximum of 0 5664 This tells us little about the distribution of values though the fact that the mean is well below the midrange of 4013 might suggest some skewness The histogram in Figure 16 4 is much more revealing Based on what we see thus far an assumption of normality might not be appropriate The distribution seems to be bimodal but certainly no particular functional form seems natural The histogram is a crude density estimator The rectangles in the gure are called bins By construction they are of equal width The parameters of the histogram are the number of bins the bin width and the leftmost starting point Each is important in the shape of the end result Since the frequency count in the bins sums to the sample size by dividing each by n we have a density estimator that satis es an obvious requirement for a density it sums integrates to one We can formalize this by laying out the method by which the frequencies are obtained Let xk be the midpoint of the kth bin and let h be the width of the bin we will shortly rename h to be the bandwidth for the density estimator The distance to the left and right boundaries of the bins are h 2 The frequency count in each bin is the number of observations in the sample which fall in the range xk h 2 Collecting terms we have our estimator f x 1 frequency in binx 1 n width of binx n
    n i 1

    1 h h 1 x xi x h 2 2

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    455

    where 1 statement denotes an indicator function which equals 1 if the statement is true and 0 if it is false and binx denotes the bin which has x as its midpoint We see then that the histogram is an estimator at least in some respects like other estimators we have encountered The event in the indicator can be rearranged to produce an equivalent form f x 1 n
    n i 1

    1 1 xi x 1 1 h 2 h 2

    This form of the estimator simply counts the number of points that are within 1 2 bin width of xk Albeit rather crude this naive its formal name in the literature estimator is in the form of kernel density estimators that we have met at various points f x 1 n
    n i 1

    1 xi x K h h

    where K z 1 1 2 z 1 2

    The naive estimator has several shortcomings It is neither smooth nor continuous Its shape is partly determined by where the leftmost and rightmost terminals of the histogram are set In constructing a histogram one often chooses the bin width to be a speci ed fraction of the sample range If so then the terminals of the lowest and highest bins will equal the minimum and maximum values in the sample and this will partly determine the shape of the histogram If instead the bin width is set irrespective of the sample values then this problem is resolved More importantly the shape of the histogram will be crucially dependent on the bandwidth itself Unfortunately this problem remains even with more sophisticated speci cations The crudeness of the weighting function in the estimator is easy to remedy Rosenblatt s 1956 suggestion was to substitute for the naive estimator some other weighting function which is continuous and which also integrates to one A number of candidates have been suggested including the long list in Table 16 4 Each of these is smooth continuous symmetric and equally attractive The Parzen logit and normal kernels are de ned so that the weight only asymptotically falls to zero whereas the others fall to zero at speci c points It has been observed that in constructing density estimator the choice of kernel function is rarely crucial and is usually minor in importance compared to the more dif cult problem of choosing the bandwidth The logit and normal kernels appear to be the default choice in many applications

    TABLE 16 4 Kernel

    Kernels for Density Estimation
    Formula K z

    Epanechnikov Normal Logit Uniform Beta Cosine Triangle Parzen

    75 1 2z2 2 236 if z 5 0 else z normal density z 1 z logistic density 5 if z 1 0 else 1 z 1 z 24 if z 1 0 else 1 cos 2 z if z 5 0 else 1 z if z 1 0 else 4 3 8z2 8 z 3 if z 5 8 1 z 3 3 else

    Greene 50240

    book

    June 20 2002

    18 2

    456

    CHAPTER 16 Estimation Frameworks in Econometrics

    The kernel density function is an estimator For any speci c x f x is a sample statistic f z 1 n
    n

    g xi z h
    i 1

    Since g xi z h is nonlinear we should expect a bias in a nite sample It is tempting to apply our usual results for sample moments but the analysis is more complicated because the bandwidth is a function of n Pagan and Ullah 1999 have examined the properties of kernel estimators in detail and found that under certain assumptions the estimator is consistent and asymptotically normally distributed but biased in nite samples The bias is a function of the bandwidth but for an appropriate choice of h does vanish asymptotically As intuition might suggest the larger is the bandwidth the greater is the bias but at the same time the smaller is the variance This might suggest a search for an optimal bandwidth After a lengthy analysis of the subject however the authors conclusion provides little guidance for nding one One consideration does seem useful In order for the proportion of observations captured in the bin to converge to the corresponding area under the density the width itself must shrink more slowly than 1 n Common applications typically use a bandwidth equal to some multiple of n 1 5 for this reason Thus the one we used earlier is h 0 9 s n1 5 To conclude the illustration begun earlier Figure 16 5 is a logit based kernel density estimator for the distribution of slope estimates for the model estimated earlier The resemblance to the histogram is to be expected

    FIGURE 16 5

    Kernel Density for Coef cients

    Kernel Density Estimate for BSALES 7 20

    5 76

    Density

    4 32

    2 88

    1 44

    0 00 2

    3

    4 BSALES

    5

    6

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics 16 4 2 NONPARAMETRIC REGRESSION

    457

    The regression function of a variable y on a single variable x is speci ed as y x No assumptions about distribution homoscedasticity serial correlation or most importantly functional form are made at the outset x may be quite nonlinear Since this is the conditional mean the only substantive restriction would be that deviations from the conditional mean function are not a function of correlated with x We have already considered several possible strategies for allowing the conditional mean to be nonlinear including spline functions polynomials logs dummy variables and so on But each of these is a global speci cation The functional form is still the same for all values of x Here we are interested in methods that do not assume any particular functional form The simplest case to analyze would be one in which several different observations on yi were made with each speci c value of xi Then the conditional mean function could be estimated naturally using the simple group means The approach has two shortcomings however Simply connecting the points of means xi y xi does not produce a smooth function The method would still be assuming something speci c about the function between the points which we seek to avoid Second this sort of data arrangement is unlikely to arise except in an experimental situation Given that data are not likely to be grouped another possibility is a piecewise regression in which we de ne neighborhoods of points around each x of interest and t a separate linear or quadratic regression in each neighborhood This returns us to the problem of continuity that we noted earlier but the method of splines is actually designed speci cally for this purpose Still unless the number of neighborhoods is quite large such a function is still likely to be crude Smoothing techniques are designed to allow construction of an estimator of the conditional mean function without making strong assumptions about the behavior of the function between the points They retain the usefulness of the nearest neighbor concept but use more elaborate schemes to produce smooth well behaved functions The general class may be de ned by a conditional mean estimating function
    n n

    x
    i 1

    wi x x1 x2 xn yi
    i 1

    wi x x yi

    where the weights sum to 1 The linear least squares regression line is such an estimator The predictor is x a bx where a and b are the least squares constant and slope For this function you can show that x xi x 1 wi x x n n 2 i 1 xi x The problem with this particular weighting function which we seek to avoid here is that it allows every xi to be in the neighborhood of x but it does not reduce the weight of any xi when it is far from x A number of smoothing functions have been suggested

    Greene 50240

    book

    June 20 2002

    18 2

    458

    CHAPTER 16 Estimation Frameworks in Econometrics

    which are designed to produce a better behaved regression function See Cleveland 1979 and Schimek 2000 We will consider two The locally weighted smoothed regression estimator loess or lowess depending on your source is based on explicitly de ning a neighborhood of points that is close to x This requires the choice of a bandwidth h The neighborhood is the set of points for which x xi is small For example the set of points that are within the range x h 2 as in our original histogram might constitute the neighborhood A suitable weight is then required Cleveland 1979 recommends the tricube weight Ti x x h 1 xi x h
    33



    Combining terms then the weight for the loess smoother is wi x x h 1 xi in the neighborhood Ti x x As always the bandwidth is crucial A wider neighborhood will produce a smoother function But the wider neighborhood will track the data less closely than a narrower one A second possibility similar to the rst is to allow the neighborhood to be all points but make the weighting function decline smoothly with the distance between x and any xi Any of the kernel functions suggested earlier will serve this purpose This produces the kernel weighted regression estimator
    n i 1

    x x h

    xi x 1 K yi h h 1 xi x n K i 1 h h

    which has become a standard tool in nonparametric analysis
    Example 16 9 A Nonparametric Average Cost Function

    In Example 16 7 we t a partially linear regression for the relationship between average cost and output for electricity supply Figures 16 6 and Figure 16 7 show the less ambitious nonparametric regressions of average cost on output The overall picture is the same as in the earlier example The kernel function is the logit density in both cases The function in Figure 16 6 uses a bandwidth of 2 000 Since this is a fairly large proportion of the range of variation of output the function is quite smooth The regression in Figure 16 7 uses a bandwidth of only 200 The function tracks the data better but at an obvious cost The example demonstrates what we and others have noted often the choice of bandwidth in this exercise is crucial

    Data smoothing is essentially data driven As with most nonparametric techniques inference is not part of the analysis this body of results is largely descriptive As can be seen in the example nonparametric regression can reveal interesting characteristics of the data set For the econometrician however there are a few drawbacks Most relationships are more complicated than simple conditional mean of one variable In the example just given some of the variation in average cost relates to differences in factor prices particularly fuel and in load factors Extensions of the fully nonparametric regression to more than one variable is feasible but very cumbersome See Hardle 1990 A promising approach is the partially linear model considered earlier

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    459

    Nonparametric Regression for AVGCOST 14
    E AvgCost Q AvgCost

    12

    E AvgCost Q

    10

    8

    6

    4 2 0
    FIGURE 16 6

    5000

    10000 15000 OUTPUT

    20000

    25000

    Nonparametric Cost Function

    FIGURE 16 7

    Nonparametric Cost Function

    Nonparametric Regression for AVGCOST 14
    E AvgCost Q AvgCost

    12

    E AvgCost Q

    10

    8

    6

    4 2 0 5000 10000 15000 OUTPUT 20000 25000

    Greene 50240

    book

    June 20 2002

    18 2

    460

    CHAPTER 16 Estimation Frameworks in Econometrics

    16 5

    PROPERTIES OF ESTIMATORS

    The preceding has been concerned with methods of estimation We have surveyed a variety of techniques that have appeared in the applied literature We have not yet examined the statistical properties of these estimators Although as noted earlier we will leave extensive analysis of the asymptotic theory for more advanced treatments it is appropriate to spend at least some time on the fundamental theoretical platform which underlies these techniques

    16 5 1

    STATISTICAL PROPERTIES OF ESTIMATORS

    Properties that we have considered are as follows







    Unbiasedness This is a nite sample property that can be established in only a very small number of cases Strict unbiasedness is rarely of central importance outside the linear regression model However asymptotic unbiasedness whereby the expectation of an estimator converges to the true parameter as the sample size grows might be of interest See e g Pagan and Ullah 1999 Section 2 5 1 on the subject of the kernel density estimator In most cases however discussions of asymptotic unbiasedness are actually directed toward consistency which is a more desirable property Consistency This is a much more important property Econometricians are rarely willing to place much credence in an estimator for which consistency cannot be established Asymptotic normality This property forms the platform for most of the statistical inference that is done with common estimators When asymptotic normality cannot be established for example for the maximum score estimator discussed in Section 21 5 3 it sometimes becomes dif cult to nd a method of progressing beyond simple presentation of the numerical values of estimates with caveats However most of the contemporary literature in macroeconomics and time series analysis is strongly focused on estimators which are decidedly not asymptotically normally distributed The implication is that this property takes its importance only in context not as an absolute virtue Asymptotic ef ciency Ef ciency can rarely be established in absolute terms Ef ciency within a class often can however Thus for example a great deal can be said about the relatively ef ciency of maximum likelihood and GMM estimators in the class of CAN estimators There are two important practical considerations in this setting First the researcher will want to know that they have not made demonstrably suboptimal use of their data The literature contains discussions of GMM estimation of fully speci ed parametric probit models GMM estimation in this context is unambiguously inferior to maximum likelihood Thus when possible one would want to avoid obviously inef cient estimators On the other hand it will usually be the case that the researcher is not choosing from a list of available estimators they have one at hand and questions of relative ef ciency are moot

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics 16 5 2 EXTREMUM ESTIMATORS

    461

    An extremum estimator is one which is obtained as the optimizer of a criterion function q data Three that have occupied much of our effort thus far are



    Least squares LS Argmax 1 n in 1 yi h xi LS 2 Maximum likelihood ML Argmax 1 n in 1 ln f yi xi ML GMM Argmax m data GMM Wm data GMM GMM

    We have changed the signs of the rst and third only for convenience so that all three may be cast as the same type of optimization problem The least squares and maximum likelihood estimators are examples of M estimators which are de ned by optimizing over a sum of terms Most of the familiar theoretical results developed here and in other treatises concern the behavior of extremum estimators Several of the estimators considered in this chapter are extremum estimators but a few including the Bayesian estimators some of the semiparametric estimators and all of the nonparametric estimators are not Nonetheless we are interested in establishing the properties of estimators in all these cases whenever possible The end result for the practitioner will be the set of statistical properties that will allow them to draw with con dence conclusions about the data generating process es that have motivated the analysis in the rst place Derivations of the behavior of extremum estimators are pursued at various levels in the literature See e g any of the sources mentioned in Footnote 1 of this chapter Amemiya 1985 and Davidson and MacKinnon 1993 are very accessible treatments Newey and McFadden 1994 is a recent rigorous analysis that provides a current standard source Our discussion at this point will only suggest the elements of the analysis The reader is referred to one of these sources for detailed proofs and derivations
    16 5 3 ASSUMPTIONS FOR ASYMPTOTIC PROPERTIES OF EXTREMUM ESTIMATORS

    Some broad results are needed in order to establish the asymptotic properties of the classical not Bayesian conventional extremum estimators noted above a The parameter space see Section 16 2 must be convex and the parameter vector that is the object of estimation must be a point in its interior The rst requirement rules out ill de ned estimation problems such as estimating a parameter which can only take one of a nite discrete set of values Thus searching for the date of a structural break in a time series model as if it were a conventional parameter leads to a nonconvexity Some proofs in this context are simpli ed by assuming that the parameter space is compact A compact set is closed and bounded However assuming compactness is usually restrictive so we will opt for the weaker requirement The criterion function must be concave in the parameters See Section A 8 2 This assumption implies that with a given data set the objective function has an interior optimum and that we can locate it Criterion functions need not be globally concave they may have multiple optima But if they are not at least locally concave then we cannot speak meaningfully about optimization One would normally only encounter this problem in a badly structured model but it is

    b

    Greene 50240

    book

    June 20 2002

    18 2

    462

    CHAPTER 16 Estimation Frameworks in Econometrics

    c

    possible to formulate a model in which the estimation criterion is monotonically increasing or decreasing in a parameter Such a model would produce a nonconcave criterion function 27 The distinction between compactness and concavity in the preceding condition is relevant at this point If the criterion function is strictly continuous in a compact parameter space then it has a maximum in that set and assuming concavity is not necessary The problem for estimation however is that this does not rule out having that maximum occur on the assumed boundary of the parameter space This case interferes with proofs of consistency and asymptotic normality The overall problem is solved by assuming that the criterion function is concave in the neighborhood of the true parameter vector Identi ability of the parameters Any statement that begins with the true parameters of the model 0 are identi ed if is problematic because if the parameters are not identi ed then arguably they are not the parameters of the any model For example there is no true parameter vector in the unidenti ed model of Example 2 5 A useful way to approach this question that avoids the ambiguity of trying to de ne the true parameter vector rst and then asking if it is identi ed estimable is as follows where we borrow from Davidson and MacKinnon 1993 p 591 Consider the parameterized model M and the set of allowable data generating processes for the model Under a particular parameterization let there be an assumed true parameter vector Consider any parameter vector in the parameter space De ne q plim qn data This function is the probability limit of the objective function under the assumed parameterization If this probability limit exists is a nite constant and moreover if q q if then if the parameter space is compact the parameter vector is identi ed by the criterion function We have not assumed compactness For a convex parameter space we would require the additional condition that there exist no sequences without limit points m such that q m converges to q The approach taken here is to assume rst that the model has some set of parameters The identi ability criterion states that assuming this is the case the probability limit of the criterion is maximized at these parameters This result rests on convergence of the criterion function to a nite value at any point in the interior of the parameter space Since the criterion function is a function of the data this convergence requires a statement of the properties of the data e g well behaved in some sense Leaving that aside for the moment interestingly the results to this

    27 In their Exercise 23 6 Grif ths Hill and Judge 1993 based alas on the rst edition of this text suggest a

    probit model for statewide voting outcomes that includes dummy variables for region Northeast Southeast West and Mountain One would normally include three of the four dummy variables in the model but Grif ths et al carefully dropped two of them because in addition to the dummy variable trap the Southeast variable is always zero when the dependent variable is zero Inclusion of this variable produces a nonconcave likelihood function the parameter on this variable diverges Analysis of a closely related case appears as a caveat on page 272 of Amemiya 1985

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    463

    point already establish the consistency of the M estimator In what might seem to be an extremely terse fashion Amemiya 1985 de ned identi ability simply as existence of a consistent estimator We see that identi cation and the conditions for consistency of the M estimator are substantively the same This form of identi cation is necessary in theory to establish the consistency arguments In any but the simplest cases however it will be extremely dif cult to verify in practice Fortunately there are simpler ways to secure identi cation that will appeal more to the intuition






    d

    For the least squares estimator a suf cient condition for identi cation is that any two different parameter vectors and 0 must be able to produce different values of the conditional mean function This means that for any two different parameter vectors there must be an xi which produces different values of the conditional mean function You should verify that for the linear model this is the full rank assumption A 2 For the model in example 2 5 we have a regression in which x2 x3 x4 In this case any parameter vector of the form 1 2 a 3 a 4 a produces the same conditional mean as 1 2 3 4 regardless of xi so this model is not identi ed The full rank assumption is needed to preclude this problem For nonlinear regressions the problem is much more complicated and there is no simple generality Example 9 2 shows a nonlinear regression model that is not identi ed and how the lack of identi cation is remedied For the maximum likelihood estimator a condition similar to that for the regression model is needed For any two parameter vectors 0 it must be possible to produce different values of the density f yi xi for some data vector yi xi Many econometric models that are t by maximum likelihood are index function models that involve densities of the form f yi xi f yi xi When this is the case the same full rank assumption that applies to the regression model may be suf cient If there are no other parameters in the model then it will be suf cient For the GMM estimator not much simplicity can be gained A suf cient condition for identi cation is that E m data 0 if 0 Behavior of the data has been discussed at various points in the preceding text The estimators are based on means of functions of observations You can see this in all three of the de nitions above Derivatives of these criterion functions will likewise be means of functions of observations Analysis of their large sample behaviors will turn on determining conditions under which certain sample means of functions of observations will be subject to laws of large numbers such as the Khinchine D 5 or Chebychev D 6 theorems and what must be assumed in order to assert that root n times sample means of functions will obey central limit theorems such as the Lindberg Feller D 19 or Lyapounov D 20 theorems for cross sections or the Martingale Difference Central Limit Theorem for dependent observations Ultimately this is the issue in establishing the statistical properties The convergence property claimed above must occur in the context of the data These conditions have been discussed in Section 5 2 and in Section 10 2 2 under the heading of well behaved data At this point we will assume that the data are well behaved

    Greene 50240

    book

    June 20 2002

    18 2

    464

    CHAPTER 16 Estimation Frameworks in Econometrics 16 5 4 ASYMPTOTIC PROPERTIES OF ESTIMATORS

    With all this apparatus in place the following are the standard results on asymptotic properties of M estimators

    THEOREM 16 1 Consistency of M Estimators If a the parameter space is convex and the true parameter vector is a point in its interior b the criterion function is concave c the parameters are identi ed by the criterion function d the data are well behaved then the M estimator converges in probability to the true parameter vector

    Proofs of consistency of M estimators rely on a fundamental convergence result that itself rests on assumptions a through d above We have assumed identi cation The fundamental device is the following Because of its dependence on the data q data is a random variable We assumed in c that plim q data q0 for any point in the parameter space Assumption c states that the maximum of q0 occurs at q0 0 so 0 is the maximizer of the probability limit By its de nition the estimator is the maximizer of q data Therefore consistency requires the limit of the maximizer be equal to the maximizer of the limit 0 Our identi cation condition establishes this We will use this approach in somewhat greater detail in Section 17 4 5a where we establish consistency of the maximum likelihood estimator

    THEOREM 16 2 Asymptotic Normality of M Estimators If i ii iii iv v is a consistent estimator of 0 where 0 is a point in the interior of the parameter space q data is concave and twice continuously differentiable in in a neighborhood of 0 d n q 0 data 0 N 0 for any in lim Pr 2 q data k m hkm 0 0 n where hkm is a continuous nite valued function of the matrix of elements H is nonsingular at 0 then d n 0 N 0 H 1 0 H 1 0

    The proof of asymptotic normality is based on the mean value theorem from calculus and a Taylor series expansion of the derivatives of the maximized criterion function around the true parameter vector q data q 0 data 2 q data 0 n n n 0 0

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    465

    The second derivative is evaluated at a point that is between and 0 that is 1 w 0 for some 0 w 1 Since we have assumed plim 0 we see that the w matrix in the second term on the right must be converging to H 0 The assumptions in the theorem can be combined to produce the claimed normal distribution Formal proof of this set of results appears in Newey and McFadden 1994 A somewhat more detailed analysis based on this theorem appears in Section 17 4 5b where we establish the asymptotic normality of the maximum likelihood estimator The preceding was restricted to M estimators so it remains to establish counterparts for the important GMM estimator Consistency follows along the same lines used earlier but asymptotic normality is a bit more dif cult to establish We will return to this issue in Chapter 18 where once again we will sketch the formal results and refer the reader to a source such as Newey and McFadden 1994 for rigorous derivation The preceding results are not straightforward in all estimation problems For example the least absolute deviations LAD is not among the estimators noted earlier but it is an M estimator and it shares the results given here The analysis is complicated because the criterion function is not continuously differentiable Nonetheless consistency and asymptotic normality have been established See Koenker and Bassett 1982 and Amemiya 1985 pp 152 154 Some of the semiparametric and all of the nonparametric estimators noted require somewhat more intricate treatments For example Pagan and Ullah Section 2 5 and 2 6 are able to establish the familiar desirable properties for the kernel density estimator f x but it requires a somewhat more involved analysis of the function and the data than is necessary say for the linear regression or binomial logit model The interested reader can nd many lengthy and detailed analyses of asymptotic properties of estimators in for example Amemiya 1985 Newey and McFadden 1994 Davidson and MacKinnon 1993 and Hayashi 2000 In practical terms it is rarely possible to verify the conditions for an estimation problem at hand and they are usually simply assumed However nding violations of the conditions is sometimes more straightforward and this is worth pursuing For example lack of parametric identi cation can often be detected by analyzing the model itself
    16 5 5 TESTING HYPOTHESES

    The preceding describes a set of results that more or less uni es the theoretical underpinnings of three of the major classes of estimators in econometrics least squares maximum likelihood and GMM A similar body of theory has been produced for the familiar test statistics Wald likelihood ratio LR and Lagrange multiplier LM See Newey and McFadden 1994 All of these have been laid out in practical terms elsewhere in this text so in the interest of brevity we will refer the interested reader to the background sources listed for the technical details Table 16 5 lists the locations in this text for various presentations of the testing procedures
    TABLE 16 5

    Text References for Testing Procedures
    Wald LR LM

    Modeling Framework

    Least Squares Nonlinear LS Maximum Likelihood GMM

    6 3 1 6 4 9 4 1 17 5 2 18 4 2

    17 6 1 9 4 1 17 5 1 18 4 2

    Exercise 6 7 9 4 2 17 5 3 18 4 2

    Greene 50240

    book

    June 20 2002

    18 2

    466

    CHAPTER 16 Estimation Frameworks in Econometrics

    16 6

    SUMMARY AND CONCLUSIONS

    This chapter has presented a short overview of estimation in econometrics There are various ways to approach such a survey The current literature can be broadly grouped by three major types of estimators parametric semiparametric and nonparametric It has been suggested that the overall drift in the literature is from the rst toward the third of these but on a closer look we see that this is probably not the case Maximum likelihood is still the estimator of choice in many settings New applications have been found for the GMM estimator but at the same time new Bayesian and simulation estimators all fully parametric are emerging at a rapid pace Certainly the range of tools that can be applied in any setting is growing steadily

    Key Terms and Concepts
    Bandwidth Bayesian estimation Bayes factor Bayes Theorem Conditional density Conjugate prior Criterion function Data generating mechanism Density Estimation criterion Extremum estimator Generalized method of Informative prior Inverted gamma distribution Joint posterior distribution Kernel density estimator Latent class model Least absolute deviations Likelihood function Linear model Loss function M estimator Markov Chain Monte Carlo Nearest neighbor Noninformative prior Nonparametric estimators Normal gamma Parameter space Parametric estimation Partially linear model Posterior density Precision matrices Prior belief Prior distribution Prior odds ratio Prior probabilities Quantile regression Semiparametric estimation Simulation based estimation Smoothing function

    method
    Maximum likelihood

    moments Gibbs sampler Hierarchical Bayes Highest posterior density interval Histogram

    estimator
    Method of moments Metropolis Hastings

    algorithm
    Multivariate t distribution

    Exercises and Questions 1 Compare the fully parametric and semiparametric approaches to estimation of a discrete choice model such as the multinomial logit model discussed in Chapter 21 What are the bene ts and costs of the semiparametric approach Asymptotics take on a different meaning in the Bayesian estimation context since parameters do not converge to a population quantity Nonetheless in a Bayesian estimation setting as the sample size increases the likelihood function will dominate the posterior density What does this imply about the Bayesian estimator when this occurs Referring to the situation in Question 2 one might think that an informative prior would outweigh the effect of the increasing sample size With respect to the Bayesian analysis of the linear regression analyze the way in which the likelihood and an informative prior will compete for dominance in the posterior mean

    2

    3

    Greene 50240

    book

    June 20 2002

    18 2

    CHAPTER 16 Estimation Frameworks in Econometrics

    467

    The following exercises require speci c software The relevant techniques are available in several packages that might be in use such as SAS Stata or LIMDEP The exercises are suggested as departure points for explorations using a few of the many estimation techniques listed in this chapter 4 Using the gasoline market data in Appendix Table F2 2 use the partially linear regression method in Section 16 3 3 to t an equation of the form ln G Pop 1 ln Income 2 ln Pnew cars 3 ln Pused cars g ln Pgasoline 5 To continue the analysis in Question 4 consider a nonparametric regression of G Pop on the price Using the nonparametric estimation method in Section 16 4 2 t the nonparametric estimator using a range of bandwidth values to explore the effect of bandwidth You might nd it useful to read the early sections of Chapter 21 for this exercise The extramarital affairs data analyzed in Section 22 3 7 can be reinterpreted in the context of a binary choice model The dependent variable in the analysis is a count of events Using these data rst recode the dependent variable 0 for none and 1 for more than zero Now rst using the binary probit estimator t a binary choice model using the same independent variables as in the example discussed in Section 22 3 7 Then using a semiparametric or nonparametric estimator estimate the same binary choice model A model for binary choice can be t for at least two purposes for estimation of interesting coef cients or for prediction of the dependent variable Use your estimated models for these two purposes and compare the two models

    6

    Greene 50240

    book

    June 26 2002

    15 8

    17

    MAXIMUM LIKELIHOOD ESTIMATION

    Q
    17 1 INTRODUCTION The generalized method of moments discussed in Chapter 18 and the semiparametric nonparametric and Bayesian estimators discussed in Chapter 16 are becoming widely used by model builders Nonetheless the maximum likelihood estimator discussed in this chapter remains the preferred estimator in many more settings than the others listed As such we focus our discussion of generally applied estimation methods on this technique Sections 17 2 through 17 5 present statistical results for estimation and hypothesis testing based on the maximum likelihood principle After establishing some general results for this method of estimation we will then extend them to the more familiar setting of econometric models Some applications are presented in Section 17 6 Finally three variations on the technique maximum simulated likelihood two step estimation and pseudomaximum likelihood estimation are described in Sections 17 7 through 17 9 17 2 THE LIKELIHOOD FUNCTION AND IDENTIFICATION OF THE PARAMETERS

    The probability density function or pdf for a random variable y conditioned on a set of parameters is denoted f y 1 This function identi es the data generating process that underlies an observed sample of data and at the same time provides a mathematical description of the data that the process will produce The joint density of n independent and identically distributed iid observations from this process is the product of the individual densities
    n

    f y1 yn
    i 1

    f yi L y

    17 1

    This joint density is the likelihood function de ned as a function of the unknown parameter vector where y is used to indicate the collection of sample data Note that we write the joint density as a function of the data conditioned on the parameters whereas when we form the likelihood function we write this function in reverse as a function of the parameters conditioned on the data Though the two functions are the same it is to be emphasized that the likelihood function is written in this fashion to
    1 Later

    we will extend this to the case of a random vector y with a multivariate density but at this point that would complicate the notation without adding anything of substance to the discussion

    468

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    469

    highlight our interest in the parameters and the information about them that is contained in the observed data However it is understood that the likelihood function is not meant to represent a probability density for the parameters as it is in Section 16 2 2 In this classical estimation framework the parameters are assumed to be xed constants which we hope to learn about from the data It is usually simpler to work with the log of the likelihood function
    n

    ln L y
    i 1

    ln f yi

    17 2

    Again to emphasize our interest in the parameters given the observed data we denote this function L data L y The likelihood function and its logarithm evaluated at are sometimes denoted simply L and ln L respectively or where no ambiguity can arise just L or ln L It will usually be necessary to generalize the concept of the likelihood function to allow the density to depend on other conditioning variables To jump immediately to one of our central applications suppose the disturbance in the classical linear regression model is normally distributed Then conditioned on it s speci c xi yi is normally distributed with mean i xi and variance 2 That means that the observed random variables are not iid they have different means Nonetheless the observations are independent and as we will examine in closer detail
    n

    ln L y X
    i 1

    ln f yi xi

    1 2

    n

    ln 2 ln 2 yi xi 2 2 17 3
    i 1

    where X is the n K matrix of data with i th row equal to xi The rest of this chapter will be concerned with obtaining estimates of the parameters and in testing hypotheses about them and about the data generating process Before we begin that study we consider the question of whether estimation of the parameters is possible at all the question of identi cation Identi cation is an issue related to the formulation of the model The issue of identi cation must be resolved before estimation can even be considered The question posed is essentially this Suppose we had an in nitely large sample that is for current purposes all the information there is to be had about the parameters Could we uniquely determine the values of from such a sample As will be clear shortly the answer is sometimes no

    DEFINITION 17 1 Identi cation The parameter vector is identi ed estimable if for any other parameter vector for some data y L y L y

    This result will be crucial at several points in what follows We consider two examples the rst of which will be very familiar to you by now
    Example 17 1 Identi cation of Parameters

    For the regression model speci ed in 17 3 suppose that there is a nonzero vector a such that xi a 0 for every xi Then there is another parameter vector a such that

    Greene 50240

    book

    June 26 2002

    15 8

    470

    CHAPTER 17 Maximum Likelihood Estimation

    xi xi for every xi You can see in 17 3 that if this is the case then the log likelihood is the same whether it is evaluated at or at As such it is not possible to consider estimation of in this model since cannot be distinguished from This is the case of perfect collinearity in the regression model which we ruled out when we rst proposed the linear regression model with Assumption 2 Identi ability of the Model Parameters The preceding dealt with a necessary characteristic of the sample data We now consider a model in which identi cation is secured by the speci cation of the parameters in the model We will study this model in detail in Chapter 21 Consider a simple form of the regression model considered above yi 1 2 xi i where i xi has a normal distribution with zero mean and variance 2 To put the model in a context consider a consumer s purchases of a large commodity such as a car where xi is the consumer s income and yi is the difference between what the consumer is willing to pay for the car pi and the price tag on the car pi Suppose rather than observing pi or pi we observe only whether the consumer actually purchases the car which we assume occurs when yi pi pi is positive Collecting this information our model states that they will purchase the car if yi 0 and not purchase it if yi 0 Let us form the likelihood function for the observed data which are purchase or not and income The random variable in this model is purchase or not purchase there are only two outcomes The probability of a purchase is Prob purchase 1 2 xi Prob yi 0 1 2 xi Prob 1 2 xi i 0 1 2 xi Prob i 1 2 xi 1 2 xi Prob i 1 2 xi 1 2 xi Prob zi 1 2 xi 1 2 xi where zi has a standard normal distribution The probability of not purchase is just one minus this probability The likelihood function is Prob purchase 1 2 xi
    i purchased i not purchased

    1 Prob purchase 1 2 xi

    We need go no further to see that the parameters of this model are not identi ed If 1 2 and are all multiplied by the same nonzero constant regardless of what it is then Prob purchase is unchanged 1 Prob purchase is also and the likelihood function does not change This model requires a normalization The one usually used is 1 but some authors e g Horowitz 1993 have used 1 1 instead

    17 3

    EFFICIENT ESTIMATION THE PRINCIPLE OF MAXIMUM LIKELIHOOD

    The principle of maximum likelihood provides a means of choosing an asymptotically ef cient estimator for a parameter or a set of parameters The logic of the technique is easily illustrated in the setting of a discrete distribution Consider a random sample of the following 10 observations from a Poisson distribution 5 0 1 1 0 3 2 3 4 and 1 The density for each observation is f yi e yi yi

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    471

    0 13 0 12 0 11 0 10 0 09
    7

    26 24 22 20 16 14 12 ln L x 10 8 6 4 2 0 3 5 25 ln L x L x 18

    10 L x

    0 08 0 07 0 06 0 05 0 04 0 03 0 02 0 01 0 0 5 0 8

    1 1

    1 4

    1 7

    2 0

    2 3

    2 6

    2 9

    3 2

    FIGURE 17 1

    Likelihood and Log likelihood Functions for a Poisson Distribution

    Since the observations are independent their joint density which is the likelihood for this sample is
    10

    f y1 y2 y10
    i 1

    f yi

    e 10

    10 y i 1 i

    10 i 1 yi



    e 10 20 207 360

    The last result gives the probability of observing this particular sample assuming that a Poisson distribution with as yet unknown parameter generated the data What value of would make this sample most probable Figure 17 1 plots this function for various values of It has a single mode at 2 which would be the maximum likelihood estimate or MLE of Consider maximizing L y with respect to Since the log function is monotonically increasing and easier to work with we usually maximize ln L y instead in sampling from a Poisson population
    n n

    ln L y n ln
    i 1

    yi
    i 1

    ln yi

    ln L y 1 n For the assumed sample of observations

    n

    yi 0 ML yn
    i 1

    ln L y 10 20 ln 12 242 d ln L y 20 10 0 2 d

    Greene 50240

    book

    June 26 2002

    15 8

    472

    CHAPTER 17 Maximum Likelihood Estimation

    and d2 ln L y 20 2 0 this is a maximum 2 d The solution is the same as before Figure 17 1 also plots the log of L y to illustrate the result The reference to the probability of observing the given sample is not exact in a continuous distribution since a particular sample has probability zero Nonetheless the principle is the same The values of the parameters that maximize L data or its log are the maximum likelihood estimates denoted Since the logarithm is a monotonic function the values that maximize L data are the same as those that maximize ln L data The necessary condition for maximizing ln L data is ln L data 0 17 4

    This is called the likelihood equation The general result then is that the MLE is a root of the likelihood equation The application to the parameters of the dgp for a discrete random variable are suggestive that maximum likelihood is a good use of the data It remains to establish this as a general principle We turn to that issue in the next section
    Example 17 2 Log Likelihood Function and Likelihood Equations for the Normal Distribution

    In sampling from a normal distribution with mean and variance 2 the log likelihood function and the likelihood equations for and 2 are n n 1 ln L 2 ln 2 ln 2 2 2 2 1 ln L 2
    n n

    i 1

    yi 2 2

    17 5

    yi 0
    i 1 n

    17 6

    ln L n 1 2 2 2 2 4

    yi 2 0
    i 1

    17 7

    To solve the likelihood equations multiply 17 6 by 2 and solve for then insert this solution in 17 7 and solve for 2 The solutions are ML 1 n
    n

    yi yn
    i 1

    and

    ML 2

    1 n

    n

    yi yn 2
    i 1

    17 8

    17 4

    PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS

    Maximum likelihood estimators MLEs are most attractive because of their largesample or asymptotic properties

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    473

    DEFINITION 17 2 Asymptotic Ef ciency An estimator is asymptotically ef cient if it is consistent asymptotically normally distributed CAN and has an asymptotic covariance matrix that is not larger than the asymptotic covariance matrix of any other consistent asymptotically normally distributed estimator 2

    If certain regularity conditions are met the MLE will have these properties The nite sample properties are sometimes less than optimal For example the MLE may be biased the MLE of 2 in Example 17 2 is biased downward The occasional statement that the properties of the MLE are only optimal in large samples is not true however It can be shown that when sampling is from an exponential family of distributions see De nition 18 1 there will exist suf cient statistics If so MLEs will be functions of them which means that when minimum variance unbiased estimators exist they will be MLEs See Stuart and Ord 1989 Most applications in econometrics do not involve exponential families so the appeal of the MLE remains primarily its asymptotic properties We use the following notation is the maximum likelihood estimator 0 denotes the true value of the parameter vector denotes another possible value of the parameter vector not the MLE and not necessarily the true values Expectation based on the true values of the parameters is denoted E0 If we assume that the regularity conditions discussed below are met by f x 0 then we have the following theorem

    THEOREM 17 1 Properties of an MLE Under regularity the maximum likelihood estimator MLE has the following asymptotic properties M1 M2 Consistency plim 0 a Asymptotic normality N 0 I 0 1 where I 0 E0 2 ln L 0 0 M3 Asymptotic ef ciency is asymptotically ef cient and achieves the Cram r Rao lower bound for consistent estimators given in M2 and e Theorem C 2 Invariance The maximum likelihood estimator of 0 c 0 is c if c 0 is a continuous and continuously differentiable function

    M4

    17 4 1

    REGULARITY CONDITIONS

    To sketch proofs of these results we rst obtain some useful properties of probability density functions We assume that y1 yn is a random sample from the population
    2 Not larger is de ned in the sense of A 118 The covariance matrix of the less ef cient estimator equals that

    of the ef cient estimator plus a nonnegative de nite matrix

    Greene 50240

    book

    June 26 2002

    15 8

    474

    CHAPTER 17 Maximum Likelihood Estimation

    with density function f yi 0 and that the following regularity conditions hold Our statement of these is informal A more rigorous treatment may be found in Stuart and Ord 1989 or Davidson and MacKinnon 1993

    DEFINITION 17 3 Regularity Conditions R1 The rst three derivatives of ln f yi with respect to are continuous and nite for almost all yi and for all This condition ensures the existence of a certain Taylor series approximation and the nite variance of the derivatives of ln L The conditions necessary to obtain the expectations of the rst and second derivatives of ln f yi are met For all values of 3 ln f yi j k l is less than a function that has a nite expectation This condition will allow us to truncate the Taylor series

    R2 R3

    With these regularity conditions we will obtain the following fundamental characteristics of f yi D1 is simply a consequence of the de nition of the likelihood function D2 leads to the moment condition which de nes the maximum likelihood estimator On the one hand the MLE is found as the maximizer of a function which mandates nding the vector which equates the gradient to zero On the other D2 is a more fundamental relationship which places the MLE in the class of generalized method of moments estimators D3 produces what is known as the Information matrix equality This relationship shows how to obtain the asymptotic covariance matrix of the MLE
    17 4 2 PROPERTIES OF REGULAR DENSITIES

    Densities that are regular by De nition 17 3 have three properties which are used in establishing the properties of maximum likelihood estimators

    THEOREM 17 2 Moments of the Derivatives of the Log Likelihood D1 ln f yi gi ln f yi and Hi 2 ln f yi i 1 n are all random samples of random variables This statement follows from our assumption of random sampling The notation gi 0 and Hi 0 indicates the derivative evaluated at 0 E0 gi 0 0 Var gi 0 E Hi 0

    D2 D3

    Condition D1 is simply a consequence of the de nition of the density

    For the moment we allow the range of yi to depend on the parameters A 0 yi B 0 Consider for example nding the maximum likelihood estimator of break

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    475

    for a continuous uniform distribution with range 0 0 In the following the single integral dyi would be used to indicate the multiple integration over all the elements of a multivariate of yi if that were necessary By de nition
    B 0 A 0

    f y i 0 dyi 1

    Now differentiate this expression with respect to 0 Leibnitz s theorem gives
    B 0 A 0

    f yi 0 dyi 0



    B 0 A 0

    f yi 0 B 0 dyi f B 0 0 0 0 A 0 0

    f A 0 0 0

    If the second and third terms go to zero then we may interchange the operations of differentiation and integration The necessary condition is that lim yi A 0 f yi 0 lim yi B 0 f yi 0 0 Note that the uniform distribution suggested above violates this condition Suf cient conditions are that the range of the observed random variable yi does not depend on the parameters which means that A 0 0 B 0 0 0 or that the density is zero at the terminal points This condition then is regularity condition R2 The latter is usually assumed and we will assume it in what follows So f yi 0 dyi 0 f yi 0 dyi 0 ln f yi 0 ln f yi 0 0 f yi 0 dyi E0 0 0

    This proves D2 Since we may interchange the operations of integration and differentiation we differentiate under the integral once again to obtain 2 ln f yi 0 ln f yi 0 f yi 0 d yi 0 f yi 0 0 0 0 0 But f yi 0 ln f yi 0 f yi 0 0 0 and the integral of a sum is the sum of integrals Therefore 2 ln f yi 0 f yi 0 dyi 0 0 ln f yi 0 ln f yi 0 f yi 0 dyi 0 0 0

    The left hand side of the equation is the negative of the expected second derivatives matrix The right hand side is the expected square outer product of the rst derivative vector But since this vector has expected value 0 we just showed this the right hand side is the variance of the rst derivative vector which proves D3 Var0 ln f yi 0 E0 0 ln f yi 0 0 ln f yi 0 0 E 2 ln f yi 0 0 0

    Greene 50240

    book

    June 26 2002

    15 8

    476

    CHAPTER 17 Maximum Likelihood Estimation 17 4 3 THE LIKELIHOOD EQUATION

    The log likelihood function is
    n

    ln L y
    i 1

    ln f yi

    The rst derivative vector or score vector is g ln L y
    n i 1

    ln f yi

    n

    gi
    i 1

    17 9

    Since we are just adding terms it follows from D1 and D2 that at 0 E0 ln L 0 y E0 g0 0 0 17 10

    which is the likelihood equation mentioned earlier
    17 4 4 THE INFORMATION MATRIX EQUALITY

    The Hessian of the log likelihood is 2 ln L y H Evaluating once again at 0 by taking
    n i 1

    2 ln f yi g0i g0 j

    N

    Hi
    i 1



    E0 g0 g0 E0

    n

    n

    i 1 j 1

    and because of D1 dropping terms with unequal subscripts we obtain
    n n

    E0 g0 g0 E0
    i 1

    g0i g0i E0
    i 1

    H0i E0 H0

    so that Var0 ln L 0 y E0 0 E0 ln L 0 y 0 2 ln L 0 y 0 0 ln L 0 y 0

    17 11

    This very useful result is known as the information matrix equality
    17 4 5 ASYMPTOTIC PROPERTIES OF THE MAXIMUM LIKELIHOOD ESTIMATOR

    We can now sketch a derivation of the asymptotic properties of the MLE Formal proofs of these results require some fairly intricate mathematics Two widely cited derivations are those of Cramer 1948 and Amemiya 1985 To suggest the avor of the exercise

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    477

    we will sketch an analysis provided by Stuart and Ord 1989 for a simple case and indicate where it will be necessary to extend the derivation if it were to be fully general
    17 4 5 a CONSISTENCY

    We assume that f yi 0 is a possibly multivariate density which at this point does not depend on covariates xi Thus this is the iid random sampling case Since is the MLE including the true 0 it must be true that in any nite sample for any ln L ln L 17 12 Consider then the random variable L L 0 Since the log function is strictly concave from Jensen s Inequality Theorem D 8 we have E0 log L L log E0 L 0 L 0 L L 0 dy 1 L 0 17 13

    The expectation on the right hand side is exactly equal to one as E0 L L 0 17 14

    is simply the integral of a joint density Now take logs on both sides of 17 13 insert the result of 17 14 then divide by n to produce E0 1 n ln L E0 1 n ln L 0 0 This produces a central result 17 15

    THEOREM 17 3 Likelihood Inequality E0 1 n ln L 0 E0 1 n ln L This result is 17 15 for any 0 including

    In words the expected value of the log likelihood is maximized at the true value of the parameters For any including
    n

    1 n ln L 1 n
    i 1

    ln f yi

    is the sample mean of n iid random variables with expectation E0 1 n ln L Since the sampling is iid by the regularity conditions we can invoke the Khinchine Theorem D 5 the sample mean converges in probability to the population mean Us ing it follows from Theorem 17 3 that as n lim Prob 1 n ln L 0 But is the MLE so for every n 1 n ln L 1 n ln L 0 1 if 1 n ln L 0 The only way these can both be true is if 1 n times the sample loglikelihood evaluated at the MLE converges to the population expectation of 1 n times the log likelihood evaluated at the true parameters There remains one nal step

    Greene 50240

    book

    June 26 2002

    15 8

    478

    CHAPTER 17 Maximum Likelihood Estimation

    Does 1 n ln L 1 n ln L 0 imply that 0 If there is a single parameter and the likelihood function is one to one then clearly so For more general cases this requires a further characterization of the likelihood function If the likelihood is strictly continuous and twice differentiable which we assumed in the regularity conditions and if the parameters of the model are identi ed which we assumed at the beginning of this discussion then yes it does so we have the result This is a heuristic proof As noted formal presentations appear in more advanced treatises than this one We should also note we have assumed at several points that sample means converged to the population expectations This is likely to be true for the sorts of applications usually encountered in econometrics but a fully general set of results would look more closely at this condition Second we have assumed iid sampling in the preceding that is the density for yi does not depend on any other variables xi This will almost never be true in practice Assumptions about the behavior of these variables will enter the proofs as well For example in assessing the large sample behavior of the least squares estimator we have invoked an assumption that the data are well behaved The same sort of consideration will apply here as well We will return to this issue shortly With all this in place we have property M1 plim 0
    17 4 5 b ASYMPTOTIC NORMALITY

    At the maximum likelihood estimator the gradient of the log likelihood equals zero by de nition so g 0 This is the sample statistic not the expectation Expand this set of equations in a second order Taylor series around the true parameters 0 We will use the mean value theorem to truncate the Taylor series at the second term g g 0 H 0 0 The Hessian is evaluated at a point that is between and 0 w 1 w 0 for some 0 w 1 We then rearrange this function and multiply the result by n to obtain n 0 H 1 ng 0 Because plim 0 0 plim 0 as well The second derivatives are continuous functions Therefore if the limiting distribution exists then d n 0 H 0 1 ng 0 By dividing H 0 and g 0 by n we obtain d 1 n 0 n H 0 ng 0

    1

    We may apply the Lindberg Levy central limit theorem D 18 to ng 0 since it is n times mean of a random sample we have invoked D1 again The limiting the variance of ng 0 is E0 1 n H 0 so d 1 ng 0 N 0 E0 n H 0

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    479

    By virtue of Theorem D 2 plim 1 n H 0 E0 1 n H 0 Since this result is a constant matrix we can combine results to obtain d 1 1 1 1 1 1 1 n H 0 ng 0 N 0 E0 n H 0 E0 n H 0 E0 n H 0 or d n 0 N 0 E0
    1 H 0 n 1



    which gives the asymptotic distribution of the MLE a N 0 I 0 1 This last step completes M2
    Example 17 3 Information Matrix for the Normal Distribution

    For the likelihood function in Example 17 2 the second derivatives are 2 ln L n 2 2 2 ln L n 1 6 2 2 2 4 ln L 1 4 2
    2 n n

    xi 2
    i 1

    xi
    i 1

    For the asymptotic variance of the maximum likelihood estimator we need the expectations of these derivatives The rst is nonstochastic and the third has expectation 0 as E xi That leaves the second which you can verify has expectation n 2 4 because each of the n terms xi 2 has expected value 2 Collecting these in the information matrix reversing the sign and inverting the matrix gives the asymptotic covariance matrix for the maximum likelihood estimators E 0
    17 4 5 c

    2 ln L 0 0

    1



    2 n 0

    0 2 4 n



    ASYMPTOTIC EFFICIENCY

    Theorem C 2 provides the lower bound for the variance of an unbiased estimator Since the asymptotic variance of the MLE achieves this bound it seems natural to extend the result directly There is however a loose end in that the MLE is almost never unbiased As such we need an asymptotic version of the bound which was provided by Cramer 1948 and Rao 1945 hence the name

    THEOREM 17 4 Cramer Rao Lower Bound Assuming that the density of yi satis es the regularity conditions R1 R3 the asymptotic variance of a consistent and asymptotically normally distributed estimator of the parameter vector 0 will always be at least as large as I 0 1 E0 2 ln L 0 0 0
    1

    E0

    ln L 0 0

    ln L 0 0

    1



    Greene 50240

    book

    June 26 2002

    15 8

    480

    CHAPTER 17 Maximum Likelihood Estimation

    The asymptotic variance of the MLE is in fact equal to the Cramer Rao Lower Bound for the variance of a consistent estimator so this completes the argument 3
    17 4 5 d INVARIANCE

    Lastly the invariance property M4 is a mathematical result of the method of computing MLEs it is not a statistical result as such More formally the MLE is invariant to one toone transformations of Any transformation that is not one to one either renders the model inestimable if it is one to many or imposes restrictions if it is many to one Some theoretical aspects of this feature are discussed in Davidson and MacKinnon 1993 pp 253 255 For the practitioner the result can be extremely useful For example when a parameter appears in a likelihood function in the form 1 j it is usually worthwhile to reparameterize the model in terms of j 1 j In an important application Olsen 1978 used this result to great advantage See Section 22 2 3 Suppose that the normal log likelihood in Example 17 2 is parameterized in terms of the precision parameter 2 1 2 The log likelihood becomes ln L 2 n 2 ln 2 n 2 ln 2 2 2
    n

    yi 2
    i 1

    The MLE for is clearly still x But the likelihood equation for 2 is now ln L 2 2 2
    n i 1 yi

    1 n 2 2
    2 2

    n

    yi 2 0
    i 1

    which has solution n 1 as expected There is a second impli cation If it is desired to analyze a function of an MLE then the function of will itself be the MLE
    17 4 5 e CONCLUSION

    These four properties explain the prevalence of the maximum likelihood technique in econometrics The second greatly facilitates hypothesis testing and the construction of interval estimates The third is a particularly powerful result The MLE has the minimum variance achievable by a consistent and asymptotically normally distributed estimator
    17 4 6 ESTIMATING THE ASYMPTOTIC VARIANCE OF THE MAXIMUM LIKELIHOOD ESTIMATOR

    The asymptotic covariance matrix of the maximum likelihood estimator is a matrix of parameters that must be estimated that is it is a function of the 0 that is being estimated If the form of the expected values of the second derivatives of the loglikelihood is known then I 0 1 E0
    3A

    2 ln L 0 0 0

    1

    17 16

    result reported by LeCam 1953 and recounted in Amemiya 1985 p 124 suggests that in principle there do exist CAN functions of the data with smaller variances than the MLE But the nding is a narrow result with no practical implications For practical purposes the statement may be taken as given

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    481

    can be evaluated at to estimate the covariance matrix for the MLE This estimator will rarely be available The second derivatives of the log likelihood will almost always be complicated nonlinear functions of the data whose exact expected values will be unknown There are however two alternatives A second estimator is I 1 2 ln L
    1



    17 17

    This estimator is computed simply by evaluating the actual not expected second derivatives matrix of the log likelihood function at the maximum likelihood estimates It is straightforward to show that this amounts to estimating the expected second derivatives of the density with the sample mean of this quantity Theorem D 4 and Result D 5 can be used to justify the computation The only shortcoming of this estimator is that the second derivatives can be complicated to derive and program for a computer A third estimator based on result D3 in Theorem 17 2 that the expected second derivatives matrix is the covariance matrix of the rst derivatives vector is I 1 where gi and G g1 g2 gn G is an n K matrix with i th row equal to the transpose of the ith vector of derivatives in the terms of the log likelihood function For a single parameter this estimator is just the reciprocal of the sum of squares of the rst derivatives This estimator is extremely convenient in most cases because it does not require any computations beyond those required to solve the likelihood equation It has the added virtue that it is always nonnegative de nite For some extremely complicated log likelihood functions sometimes because of rounding error the observed Hessian can be inde nite even at the maximum of the function The estimator in 17 18 is known as the BHHH estimator4 and the outer product of gradients or OPG estimator None of the three estimators given here is preferable to the others on statistical grounds all are asymptotically equivalent In most cases the BHHH estimator will be the easiest to compute One caution is in order As the example below illustrates these estimators can give different results in a nite sample This is an unavoidable nite sample problem that can in some cases lead to different statistical conclusions The example is a case in point Using the usual procedures we would reject the hypothesis that 0 if either of the rst two variance estimators were used but not if the third were used The estimator in 17 16 is usually unavailable as the exact expectation of the Hessian is rarely known Available evidence suggests that in small or moderate sized samples 17 17 the Hessian is preferable
    4 It

    n

    1

    gi gi
    i 1

    G G 1

    17 18

    ln f xi

    appears to have been advocated rst in the econometrics literature in Berndt et al 1974

    Greene 50240

    book

    June 26 2002

    15 8

    482

    CHAPTER 17 Maximum Likelihood Estimation Example 17 4 Variance Estimators for an MLE

    The sample data in Example C 1 are generated by a model of the form f yi xi 1 e yi xi xi

    where y income and x education To nd the maximum likelihood estimate of we maximize
    n n

    ln L
    i 1

    ln xi
    i 1

    yi xi

    The likelihood equation is ln L
    n

    i 1

    1 xi

    n

    i 1

    yi 0 xi 2

    17 19

    which has the solution 15 602727 To compute the asymptotic variance of the MLE we require 2 ln L 2
    n

    i 1

    1 2 xi 2

    n

    i 1

    yi xi 3

    17 20

    Since the function E yi xi is known the exact form of the expected value in 17 20 is known Inserting xi for yi in 17 20 and taking the reciprocal yields the rst variance estimate 44 2546 Simply inserting 15 602727 in 17 20 and taking the negative of the reciprocal gives the second estimate 46 16337 Finally by computing the reciprocal of the sum of squares of rst derivatives of the densities evaluated at I 1
    n 1 i 1

    1 xi yi xi 2 2

    we obtain the BHHH estimate 100 5116

    17 4 7

    CONDITIONAL LIKELIHOODS AND ECONOMETRIC MODELS

    All of the preceding results form the statistical underpinnings of the technique of maximum likelihood estimation But for our purposes a crucial element is missing We have done the analysis in terms of the density of an observed random variable and a vector of parameters f yi But econometric models will involve exogenous or predetermined variables xi so the results must be extended A workable approach is to treat this modeling framework the same as the one in Chapter 5 where we considered the large sample properties of the linear regression model Thus we will allow xi to denote a mix of random variables and constants that enter the conditional density of yi By partitioning the joint density of yi and xi into the product of the conditional and the marginal the log likelihood function may be written
    n n n

    ln L data
    i 1

    ln f yi xi
    i 1

    ln f yi xi
    i 1

    ln g xi

    where any nonstochastic elements in xi such as a time trend or dummy variable are being carried as constants In order to proceed we will assume as we did before that the

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    483

    process generating xi takes place outside the model of interest For present purposes that means that the parameters that appear in g xi do not overlap with those that appear in f yi xi Thus we partition into so that the log likelihood function may be written
    n n n

    ln L data
    i 1

    ln f yi xi
    i 1

    ln f yi xi
    i 1

    ln g xi

    As long as and have no elements in common and no restrictions connect them such as 1 then the two parts of the log likelihood may be analyzed separately In most cases the marginal distribution of xi will be of secondary or no interest Asymptotic results for the maximum conditional likelihood estimator must now account for the presence of xi in the functions and derivatives of ln f yi xi We will proceed under the assumption of well behaved data so that sample averages such as 1 n ln L y X 1 n
    n

    ln f yi xi
    i 1

    and its gradient with respect to will converge in probability to their population expectations We will also need to invoke central limit theorems to establish the asymptotic normality of the gradient of the log likelihood so as to be able to characterize the MLE itself We will leave it to more advance treatises such as Amemiya 1985 and Newey and McFadden 1994 to establish speci c conditions and ne points that must be assumed to claim the usual properties for maximum likelihood estimators For present purposes and the vast bulk of empirical applications the following minimal assumptions should suf ce





    Parameter space Parameter spaces that have gaps and nonconvexities in them will generally disable these procedures An estimation problem that produces this failure is that of estimating a parameter that can take only one among a discrete set of values For example this set of procedures does not include estimating the timing of a structural change in a model See Section 7 4 The likelihood function must be a continuous function of a convex parameter space We allow unbounded parameter spaces such as 0 in the regression model for example Identi ability Estimation must be feasible This is the subject of de nition 17 1 concerning identi cation and the surrounding discussion Well behaved data Laws of large numbers apply to sample means involving the data and some form of central limit theorem generally Lyapounov can be applied to the gradient Ergodic stationarity is broad enough to encompass any situation that is likely to arise in practice though it is probably more general than we need for most applications since we will not encounter dependent observations speci cally until later in the book The de nitions in Chapter 5 are assumed to hold generally

    With these in place analysis is essentially the same in character as that we used in the linear regression model in Chapter 5 and follows precisely along the lines of Section 16 5

    Greene 50240

    book

    June 26 2002

    15 8

    484

    CHAPTER 17 Maximum Likelihood Estimation

    17 5

    THREE ASYMPTOTICALLY EQUIVALENT TEST PROCEDURES

    The next several sections will discuss the most commonly used test procedures the likelihood ratio Wald and Lagrange multiplier tests Extensive discussion of these procedures is given in Godfrey 1988 We consider maximum likelihood estimation of a parameter and a test of the hypothesis H0 c 0 The logic of the tests can be seen in Figure 17 2 5 The gure plots the log likelihood function ln L its derivative with respect to d ln L d and the constraint c There are three approaches to testing the hypothesis suggested in the gure





    Likelihood ratio test If the restriction c 0 is valid then imposing it should not lead to a large reduction in the log likelihood function Therefore we base the test on the difference ln LU ln LR where LU is the value of the likelihood function at the unconstrained value of and LR is the value of the likelihood function at the restricted estimate Wald test If the restriction is valid then c MLE should be close to zero since the MLE is consistent Therefore the test is based on c MLE We reject the hypothesis if this value is signi cantly different from zero Lagrange multiplier test If the restriction is valid then the restricted estimator should be near the point that maximizes the log likelihood Therefore the slope of the log likelihood function should be near zero at the restricted estimator The test is based on the slope of the log likelihood at the point where the function is maximized subject to the restriction

    These three tests are asymptotically equivalent under the null hypothesis but they can behave rather differently in a small sample Unfortunately their small sample properties are unknown except in a few special cases As a consequence the choice among them is typically made on the basis of ease of computation The likelihood ratio test requires calculation of both restricted and unrestricted estimators If both are simple to compute then this way to proceed is convenient The Wald test requires only the unrestricted estimator and the Lagrange multiplier test requires only the restricted estimator In some problems one of these estimators may be much easier to compute than the other For example a linear model is simple to estimate but becomes nonlinear and cumbersome if a nonlinear constraint is imposed In this case the Wald statistic might be preferable Alternatively restrictions sometimes amount to the removal of nonlinearities which would make the Lagrange multiplier test the simpler procedure
    17 5 1 THE LIKELIHOOD RATIO TEST

    Let be a vector of parameters to be estimated and let H0 specify some sort of restriction on these parameters Let U be the maximum likelihood estimator of obtained without regard to the constraints and let R be the constrained maximum likelihood estimator U and LR are the likelihood functions evaluated at these two estimates then the If L
    5 See Buse 1982 Note that the scale of the vertical axis would be different for each curve As such the points

    of intersection have no signi cance

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    485

    ln L d ln L d c d ln L d ln L Likelihood ratio ln LR ln L

    c Lagrange multiplier Wald 0



    R



    MLE

    FIGURE 17 2

    Three Bases for Hypothesis Tests

    likelihood ratio is LR LU 17 21

    This function must be between zero and one Both likelihoods are positive and LR U A restricted optimum is never superior to an unrestricted cannot be larger than L one If is too small then doubt is cast on the restrictions An example from a discrete distribution helps to x these ideas In estimating from a sample of 10 from a Poisson distribution at the beginning of Section 17 3 we found the

    Greene 50240

    book

    June 26 2002

    15 8

    486

    CHAPTER 17 Maximum Likelihood Estimation

    MLE of the parameter to be 2 At this value the likelihood which is the probability of observing the sample we did is 0 104 10 8 Are these data consistent with H0 1 8 LR 0 936 10 9 which is as expected smaller This particular sample is somewhat less probable under the hypothesis The formal test procedure is based on the following result

    THEOREM 17 5 Limiting Distribution of the Likelihood Ratio Test Statistic Under regularity and under H0 the large sample distribution of 2 ln is chisquared with degrees of freedom equal to the number of restrictions imposed

    The null hypothesis is rejected if this value exceeds the appropriate critical value from the chi squared tables Thus for the Poisson example 2 ln 2 ln 0 0936 0 104 0 21072

    This chi squared statistic with one degree of freedom is not signi cant at any conventional level so we would not reject the hypothesis that 1 8 on the basis of this test 6 It is tempting to use the likelihood ratio test to test a simple null hypothesis against a simple alternative For example we might be interested in the Poisson setting in testing H0 1 8 against H1 2 2 But the test cannot be used in this fashion The degrees of freedom of the chi squared statistic for the likelihood ratio test equals the reduction in the number of dimensions in the parameter space that results from imposing the restrictions In testing a simple null hypothesis against a simple alternative this value is zero 7 Second one sometimes encounters an attempt to test one distributional assumption against another with a likelihood ratio test for example a certain model will be estimated assuming a normal distribution and then assuming a t distribution The ratio of the two likelihoods is then compared to determine which distribution is preferred This comparison is also inappropriate The parameter spaces and hence the likelihood functions of the two cases are unrelated
    17 5 2 THE WALD TEST

    A practical shortcoming of the likelihood ratio test is that it usually requires estimation of both the restricted and unrestricted parameter vectors In complex models one or the other of these estimates may be very dif cult to compute Fortunately there are two alternative testing procedures the Wald test and the Lagrange multiplier test that circumvent this problem Both tests are based on an estimator that is asymptotically normally distributed
    6 Of

    course our use of the large sample result in a sample of 10 might be questionable

    7 Note

    that because both likelihoods are restricted in this instance there is nothing to prevent 2 ln from being negative

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    487

    These two tests are based on the distribution of the full rank quadratic form considered in Section B 11 6 Speci cally If x NJ then x
    1

    x chi squared J

    17 22

    In the setting of a hypothesis test under the hypothesis that E x the quadratic form has the chi squared distribution If the hypothesis that E x is false however then the quadratic form just given will on average be larger than it would be if the hypothesis were true 8 This condition forms the basis for the test statistics discussed in this and the next section Let be the vector of parameter estimates obtained without restrictions We hypothesize a set of restrictions H0 c q If the restrictions are valid then at least approximately should satisfy them If the q should be farther from 0 than would hypothesis is erroneous however then c be explained by sampling variability alone The device we use to formalize this idea is the Wald test

    THEOREM 17 6 Limiting Distribution of the Wald Test Statistic The Wald statistic is W c q Asy Var c q
    1

    c q

    Under H0 in large samples W has a chi squared distribution with degrees of freedom equal to the number of restrictions i e the number of equations in c q 0 A derivation of the limiting distribution of the Wald statistic appears in Theorem 6 15

    This test is analogous to the chi squared statistic in 17 22 if c q is normally distributed with the hypothesized mean of 0 A large value of W leads to rejection of the hypothesis Note nally that W only requires computation of the unrestricted model One must still compute the covariance matrix appearing in the preceding quadratic form This result is the variance of a possibly nonlinear function which we treated earlier Est Asy Var c q C Est Asy Var C c C 17 23

    That is C is the J K matrix whose jth row is the derivatives of the jth constraint with respect to the K elements of A common application occurs in testing a set of linear restrictions
    8 If the mean is not then the statistic in 17 22 will have a noncentral chi squared distribution This distribution has the same basic shape as the central chi squared distribution with the same degrees of freedom but lies to the right of it Thus a random draw from the noncentral distribution will tend on average to be larger than a random observation from the central distribution

    Greene 50240

    book

    June 26 2002

    15 8

    488

    CHAPTER 17 Maximum Likelihood Estimation

    For testing a set of linear restrictions R q the Wald test would be based on H0 c q R q 0 c C R Est Asy Var c q R Est Asy Var R and W R q R Est Asy Var R 1 R q The degrees of freedom is the number of rows in R If c q is a single restriction then the Wald test will be the same as the test based on the con dence interval developed previously If the test is H0 0 then the earlier test is based on z 0 s 17 25 versus H1 0 17 24

    where s is the estimated asymptotic standard error The test statistic is compared to the appropriate value from the standard normal table The Wald test will be based on W 0 0 Asy Var 0 0
    1

    0 0

    0 2 z2 17 26 Asy Var

    Here W has a chi squared distribution with one degree of freedom which is the distribution of the square of the standard normal test statistic in 17 25 To summarize the Wald test is based on measuring the extent to which the unrestricted estimates fail to satisfy the hypothesized restrictions There are two shortcomings of the Wald test First it is a pure signi cance test against the null hypothesis not necessarily for a speci c alternative hypothesis As such its power may be limited in some settings In fact the test statistic tends to be rather large in applications The second shortcoming is not shared by either of the other test statistics discussed here The Wald statistic is not invariant to the formulation of the restrictions For example for a test of the hypothesis that a function 1 equals a speci c value q there are two approaches one might choose A Wald test based directly on q 0 would use a statistic based on the variance of this nonlinear function An alternative approach would be to analyze the linear restriction q 1 0 which is an equivalent but linear restriction The Wald statistics for these two tests could be different and might lead to different inferences These two shortcomings have been widely viewed as compelling arguments against use of the Wald test But in its favor the Wald test does not rely on a strong distributional assumption as do the likelihood ratio and Lagrange multiplier tests The recent econometrics literature is replete with applications that are based on distribution free estimation procedures such as the GMM method As such in recent years the Wald test has enjoyed a redemption of sorts

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation 17 5 3 THE LAGRANGE MULTIPLIER TEST

    489

    The third test procedure is the Lagrange multiplier LM or ef cient score or just score test It is based on the restricted model instead of the unrestricted model Suppose that we maximize the log likelihood subject to the set of constraints c q 0 Let be a vector of Lagrange multipliers and de ne the Lagrangean function ln L ln L c q The solution to the constrained maximization problem is the root of ln L ln L C 0 ln L c q 0

    17 27

    where C is the transpose of the derivatives matrix in the second line of 17 23 If the restrictions are valid then imposing them will not lead to a signi cant difference in the maximized value of the likelihood function In the rst order conditions the meaning is that the second term in the derivative vector will be small In particular will be small We could test this directly that is test H0 0 which leads to the Lagrange multiplier test There is an equivalent simpler formulation however At the restricted maximum the derivatives of the log likelihood function are ln L R C g R R 17 28

    If the restrictions are valid at least within the range of sampling variability then g R 0 That is the derivatives of the log likelihood evaluated at the restricted parameter vector will be approximately zero The vector of rst derivatives of the log likelihood is the vector of ef cient scores Since the test is based on this vector it is called the score test as well as the Lagrange multiplier test The variance of the rst derivative vector is the information matrix which we have used to compute the asymptotic covariance matrix of the MLE The test statistic is based on reasoning analogous to that underlying the Wald test statistic

    THEOREM 17 7 Limiting Distribution of the Lagrange Multiplier Statistic The Lagrange multiplier test statistic is LM ln L R R I R 1 ln L R R

    Under the null hypothesis LM has a limiting chi squared distribution with degrees of freedom equal to the number of restrictions All terms are computed at the restricted estimator

    Greene 50240

    book

    June 26 2002

    15 8

    490

    CHAPTER 17 Maximum Likelihood Estimation

    The LM statistic has a useful form Let gi R denote the ith term in the gradient of the log likelihood function Then
    n

    gR
    i 1

    gi R G Ri

    where G R is the n K matrix with ith row equal to gi R and i is a column of 1s If we use the BHHH outer product of gradients estimator in 17 18 to estimate the Hessian then I 1 G RG R 1 and LM i G R G RG R 1 G Ri Now since i i equals n LM n i G R G RG R 1 G Ri n nR2 which is n times the i uncentered squared multiple correlation coef cient in a linear regression of a column of 1s on the derivatives of the log likelihood function computed at the restricted estimator We will encounter this result in various forms at several points in the book
    17 5 4 AN APPLICATION OF THE LIKELIHOOD BASED TEST PROCEDURES

    Consider again the data in Example C 1 In Example 17 4 the parameter in the model 1 f yi xi e yi xi 17 29 xi was estimated by maximum likelihood For convenience let i 1 xi This exponential density is a restricted form of a more general gamma distribution f yi xi i 1 yi i ye i H1 1


    17 30

    The restriction is 1 9 We consider testing the hypothesis H0 1 versus

    using the various procedures described previously The log likelihood and its derivatives are
    n n n

    ln L
    i 1

    ln i n ln 1
    i 1 n

    ln yi
    i 1

    yi i
    n

    ln L 2 ln L 2
    9 The

    n

    i
    i 1 i 1 n

    yi i2 yi i3
    i 1

    ln L

    n

    ln i n
    i 1 i 1

    ln yi
    n

    17 31 i

    n

    i2 2
    i 1

    2 ln L n 2



    2 ln L

    i 1

    gamma function and the gamma distribution are described in Sections B 4 5 and E 5 3

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    491

    TABLE 17 1 Quantity

    Maximum Likelihood Estimates
    Unrestricted Estimate a Restricted Estimate

    ln L ln L ln L 2 ln L 2 2 ln L 2 2 ln L
    a Estimated

    4 7198 2 344 3 1517 0 7943 82 91444 0 0000 0 0000 0 85628 7 4569 2 2423

    15 6052 6 794 1 0000 0 000 88 43771 0 0000 7 9162 0 021659 32 8987 0 66885

    asymptotic standard errors based on V are given in parentheses

    Recall that d ln d and d2 ln d 2 Unrestricted maximum likelihood estimates of and are obtained by equating the two rst derivatives to zero The restricted maximum likelihood estimate of is obtained by equating ln L to zero while xing at one The results are shown in Table 17 1 Three estimators are available for the asymptotic covariance matrix of the estimators of Using the actual Hessian as in 17 17 we compute V i 2 ln L 1 at the maximum likelihood estimates For this model it is easy to show that E yi xi xi either by direct integration or more simply by using the result that E ln L 0 to deduce it Therefore we can also use the expected Hessian as in 17 16 to compute V E i E 2 ln L 1 Finally by using the sums of squares and cross products of the rst derivatives we obtain the BHHH estimator in 17 18 V B i ln L ln L 1 Results in Table 17 1 are based on V The three estimators of the asymptotic covariance matrix produce notably different results V 5 495 1 652 1 652 0 6309 VE 4 897 1 473 1 473 0 5770 VB 13 35 4 314 4 314 1 535

    Given the small sample size the differences are to be expected Nonetheless the striking difference of the BHHH estimator is typical of its erratic performance in small samples





    Con dence Interval Test A 95 percent con dence interval for based on the unrestricted estimates is 3 1517 1 96 0 6309 1 5942 4 7085 This interval does not contain 1 so the hypothesis is rejected Likelihood Ratio Test The LR statistic is 2 88 43771 82 91444 11 0465 The table value for the test with one degree of freedom is 3 842 Since the computed value is larger than this critical value the hypothesis is again rejected Wald Test The Wald test is based on the unrestricted estimates For this restriction c q 1 dc d 1 Est Asy Var c q Est Asy Var 0 6309 so W 3 1517 1 2 0 6309 7 3384

    The critical value is the same as the previous one Hence H0 is once again rejected Note that the Wald statistic is the square of the corresponding test statistic that would be used in the con dence interval test 3 1517 1 0 6309 2 70895

    Greene 50240

    book

    June 26 2002

    15 8

    492

    CHAPTER 17 Maximum Likelihood Estimation



    Lagrange Multiplier Test The Lagrange multiplier test is based on the restricted estimators The estimated asymptotic covariance matrix of the derivatives used to compute the statistic can be any of the three estimators discussed earlier The BHHH estimator V B is the empirical estimator of the variance of the gradient and is the one usually used in practice This computation produces LM 0 0000 7 9162 0 0099438 0 26762 0 26762 11 197
    1

    0 0000 15 687 7 9162

    The conclusion is the same as before Note that the same computation done using V rather than V B produces a value of 5 1182 As before we observe substantial small sample variation produced by the different estimators The latter three test statistics have substantially different values It is possible to reach different conclusions depending on which one is used For example if the test had been carried out at the 1 percent level of signi cance instead of 5 percent and LM had been computed using V then the critical value from the chi squared statistic would have been 6 635 and the hypothesis would not have been rejected by the LM test Asymptotically all three tests are equivalent But in a nite sample such as this one differences are to be expected 10 Unfortunately there is no clear rule for how to proceed in such a case which highlights the problem of relying on a particular signi cance level and drawing a rm reject or accept conclusion based on sample evidence

    17 6

    APPLICATIONS OF MAXIMUM LIKELIHOOD ESTIMATION

    We now examine three applications of the maximum likelihood estimator The rst extends the results of Chapters 2 through 5 to the linear regression model with normally distributed disturbances In the second application we t a nonlinear regression model by maximum likelihood This application illustrates the effect of transformation of the dependent variable The third application is a relatively straightforward use of the maximum likelihood technique in a nonlinear model that does not involve the normal distribution This application illustrates the sorts of extensions of the MLE into settings that depart from the linear model of the preceding chapters and that are typical in econometric analysis
    17 6 1 THE NORMAL LINEAR REGRESSION MODEL

    The linear regression model is yi xi i The likelihood function for a sample of n independent identically and normally distributed disturbances is L 2 2 n 2 e 2
    2

    17 32

    10 For

    further discussion of this problem see Berndt and Savin 1977

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    493

    The transformation from i to yi is i yi xi so the Jacobian for each observation i yi is one 11 Making the transformation we nd that the likelihood function for the n observations on the observed random variable is L 2 2 n 2 e 1 2
    2

    y X y X



    17 33

    To maximize this function with respect to it will be necessary to maximize the exponent or minimize the familiar sum of squares Taking logs we obtain the log likelihood function for the classical regression model n n y X y X ln L ln 2 ln 2 2 2 2 2 The necessary conditions for maximizing this log likelihood are ln L X y X 2 0 ln L n 0 y X y X 2 2 4 2 2 The values that satisfy these equations are ML X X 1 X y b and ML 2 17 34

    17 35

    ee 17 36 n The slope estimator is the familiar one whereas the variance estimator differs from the least squares value by the divisor of n instead of n K 12 The Cramer Rao bound for the variance of an unbiased estimator is the negative inverse of the expectation of 2 ln L 2 ln L X XX 4 2 2 17 37 2 ln L 2 ln L X n 4 6 2 4 2 2 2 In taking expected values the off diagonal term vanishes leaving I 2 1 2 X X 1 0 0 2 4 n 17 38

    The least squares slope estimator is the maximum likelihood estimator for this model Therefore it inherits all the desirable asymptotic properties of maximum likelihood estimators We showed earlier that s 2 e e n K is an unbiased estimator of 2 Therefore the maximum likelihood estimator is biased toward zero K n K 2 E ML 2 17 39 1 2 2 n n
    11 See

    B 41 in Section B 5 The analysis to follow is conditioned on X To avoid cluttering the notation we will leave this aspect of the model implicit in the results As noted earlier we assume that the data generating process for X does not involve or 2 and that the data are well behaved as discussed in Chapter 5 a general rule maximum likelihood estimators do not make corrections for degrees of freedom

    12 As

    Greene 50240

    book

    June 26 2002

    15 8

    494

    CHAPTER 17 Maximum Likelihood Estimation

    Despite its small sample bias the maximum likelihood estimator of 2 has the same desirable asymptotic properties We see in 17 39 that s 2 and 2 differ only by a factor K n which vanishes in large samples It is instructive to formalize the asymptotic equivalence of the two From 17 38 we know that d n ML 2 N 0 2 4 2 It follows zn 1 K K d n ML 2 2 2 n n 1 K K N 0 2 4 2 n n

    But K n and K n vanish as n so the limiting distribution of zn is also N 0 2 4 Since zn n s 2 2 we have shown that the asymptotic distribution of s 2 is the same as that of the maximum likelihood estimator The standard test statistic for assessing the validity of a set of linear restrictions in the linear model R q 0 is the F ratio F J n K e e e e J Rb q Rs 2 X X 1 R 1 Rb q e e n K J

    With normally distributed disturbances the F test is valid in any sample size There remains a problem with nonlinear restrictions of the form c 0 since the counterpart to F which we will examine here has validity only asymptotically even with normally distributed disturbances In this section we will reconsider the Wald statistic and examine two related statistics the likelihood ratio statistic and the Lagrange multiplier statistic These statistics are both based on the likelihood function and like the Wald statistic are generally valid only asymptotically No simplicity is gained by restricting ourselves to linear restrictions at this point so we will consider general hypotheses of the form H0 c 0 H1 c 0 The Wald statistic for testing this hypothesis and its limiting distribution under H0 would be W c b C b 2 X X 1 C b 1 c b 2 J where C b c b b 17 41
    d

    17 40

    The likelihood ratio LR test is carried out by comparing the values of the loglikelihood function with and without the restrictions imposed We leave aside for the present how the restricted estimator b is computed except for the linear model which we saw earlier The test statistic and it s limiting distribution under H0 are LR 2 ln L ln L 2 J
    d

    17 42

    The log likelihood for the regression model is given in 17 34 The rst order conditions imply that regardless of how the slopes are computed the estimator of 2 without

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    495

    restrictions on will be 2 y Xb y Xb n and likewise for a restricted estimator y Xb y Xb n e e n The concentrated log likelihood13 will be 2 n ln Lc 1 ln 2 ln e e n 2 and likewise for the restricted case If we insert these in the de nition of LR then we obtain LR n ln e e e e n ln ln 2 n ln 2 2 2 17 43 The Lagrange multiplier LM test is based on the gradient of the log likelihood function The principle of the test is that if the hypothesis is valid then at the restricted estimator the derivatives of the log likelihood function should be close to zero There are two ways to carry out the LM test The log likelihood function can be maximized subject to a set of restrictions by using ln LLM n y X y X n c ln 2 ln 2 2 2

    The rst order conditions for a solution are ln LLM X y X C 0 2 ln LLM n y X y X 0 17 44 2 2 2 2 4 0 ln L LM c The solutions to these equations give the restricted least squares estimator b the usual variance estimator now e e n and the Lagrange multipliers There are now two ways to compute the test statistic In the setting of the classical linear regression model when we actually compute the Lagrange multipliers a convenient way to proceed is to test the hypothesis that the multipliers equal zero For this model the solution for is R X X 1 R 1 Rb q This equation is a linear function of the least squares estimator If we carry out a Wald test of the hypothesis that equals 0 then the statistic will be
    2 LM Est Var 1 Rb q R s X X 1 R 1 Rb q

    17 45

    2 The disturbance variance estimator s based on the restricted slopes is e e n An alternative way to compute the LM statistic often produces interesting results In most situations we maximize the log likelihood function without actually computing the vector of Lagrange multipliers The restrictions are usually imposed some other way An alternative way to compute the statistic is based on the general result that under the hypothesis being tested

    E ln L E 1 2 X 0 and Asy Var ln L E 2 ln L 1 2 X X 1 14
    13 See

    17 46

    Section E 6 3 makes use of the fact that the Hessian is block diagonal

    14 This

    Greene 50240

    book

    June 26 2002

    15 8

    496

    CHAPTER 17 Maximum Likelihood Estimation

    We can test the hypothesis that at the restricted estimator the derivatives are equal to zero The statistic would be LM e X X X 1 X e 2 nR e e n 17 47

    In this form the LM statistic is n times the coef cient of determination in a regression of the residuals ei yi xi b on the full set of regressors With some manipulation we can show that W n n K JF and LR and LM are approximately equal to this function of F 15 All three statistics converge to JF as n increases The linear model is a special case in that the LR statistic is based only on the unrestricted estimator and does not actually require computation of the restricted least squares estimator although computation of F does involve most of the computation of b Since the log function is concave and W n ln 1 W n Godfrey 1988 also shows that W LR LM so for the linear model we have a rm ranking of the three statistics There is ample evidence that the asymptotic results for these statistics are problematic in small or moderately sized samples See e g Davidson and MacKinnon 1993 pp 456 457 The true distributions of all three statistics involve the data and the unknown parameters and as suggested by the algebra converge to the F distribution from above The implication is that critical values from the chi squared distribution are likely to be too small that is using the limiting chi squared distribution in small or moderately sized samples is likely to exaggerate the signi cance of empirical results Thus in applications the more conservative F statistic or t for one restriction is likely to be preferable unless one s data are plentiful
    17 6 2 MAXIMUM LIKELIHOOD ESTIMATION OF NONLINEAR REGRESSION MODELS

    In Chapter 9 we considered nonlinear regression models in which the nonlinearity in the parameters appeared entirely on the right hand side of the equation There are models in which parameters appear nonlinearly in functions of the dependent variable as well Suppose that in general the model is g yi h xi i One approach to estimation would be least squares minimizing
    n

    S
    i 1

    g yi h xi 2

    There is no reason to expect this nonlinear least squares estimator to be consistent however though it is dif cult to show this analytically The problem is that nonlinear least squares ignores the Jacobian of the transformation Davidson and MacKinnon 1993 p 244 suggest a qualitative argument which we can illustrate with an example Suppose y is positive g y exp y and h x x In this case an obvious solution is
    15 See

    Godfrey 1988 pp 49 51

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    497

    0 and which produces a sum of squares of zero Estimation becomes a nonissue For this type of regression model however maximum likelihood estimation is consistent ef cient and generally not appreciably more dif cult than least squares For normally distributed disturbances the density of yi is f yi i 2 2 2 2 1 2 e g yi h xi 2 yi g yi i Ji yi yi
    n i 1 g yi

    The Jacobian of the transformation see 3 41 is J yi

    After collecting terms the log likelihood function will be
    n

    ln L
    i 1

    1 ln 2 ln 2 2

    n

    ln J yi
    i 1

    h xi 2

    2 2

    17 48

    In many cases including the applications considered here there is an inconsistency in the model in that the transformation of the dependent variable may rule out some values Hence the assumed normality of the disturbances cannot be strictly correct In the generalized production function there is a singularity at yi 0 where the Jacobian becomes in nite Some research has been done on speci c modi cations of the model to accommodate the restriction e g Poirier 1978 and Poirier and Melino 1978 but in practice the typical application involves data for which the constraint is inconsequential But for the Jacobians nonlinear least squares would be maximum likelihood If the Jacobian terms involve however then least squares is not maximum likelihood As regards 2 this likelihood function is essentially the same as that for the simpler nonlinear regression model The maximum likelihood estimator of 2 will be 1 n
    2 n i 1

    1 g yi h xi 2 n

    n

    ei2
    i 1

    17 49

    The likelihood equations for the unknown parameters are n i h xi 1 ln L 2 i 1 n n 0 1 Ji g yi 1 ln L 0 i 2 i 1 Ji i 1 0 ln L n n 1 i2 2 2 2 2 4
    i 1

    17 50

    These equations will usually be nonlinear so a solution must be obtained iteratively One special case that is common is a model in which is a single parameter Given a particular value of we would maximize ln L with respect to by using nonlinear least squares It would be simpler yet if in addition h xi were linear so that we could use linear least squares See the following application Therefore a way to maximize L for all the parameters is to scan over values of for the one that with the associated least squares estimates of and 2 gives the highest value of ln L Of course this requires that we know roughly what values of to examine

    Greene 50240

    book

    June 26 2002

    15 8

    498

    CHAPTER 17 Maximum Likelihood Estimation

    If is a vector of parameters then direct maximization of L with respect to the full set of parameters may be preferable Methods of maximization are discussed in Appendix E There is an additional simpli cation that may be useful Whatever values are ultimately obtained for the estimates of and the estimate of 2 will be given by 17 49 If we insert this solution in 17 48 then we obtain the concentrated log likelihood
    n

    ln Lc
    i 1

    ln J yi

    n n 1 1 ln 2 ln 2 2 n

    n

    i2
    i 1

    17 51

    This equation is a function only of and We can maximize it with respect to and and obtain the estimate of 2 as a by product See Section E 6 3 for details An estimate of the asymptotic covariance matrix of the maximum likelihood estimators can be obtained by inverting the estimated information matrix It is quite likely however that the Berndt et al 1974 estimator will be much easier to compute The log of the density for the ith observation is the ith term in 17 50 The derivatives of ln Li with respect to the unknown parameters are i 2 h xi ln Li gi ln Li 1 Ji Ji i 2 g yi 17 52 ln Li 2 1 2 2 i2 2 1

    The asymptotic covariance matrix for the maximum likelihood estimators is estimated using
    n 1

    Est Asy Var MLE
    i 1

    gi g

    G G 1

    17 53

    Note that the preceding includes of a row and a column for 2 in the covariance matrix In a model that transforms y as well as x the Hessian of the log likelihood is generally not block diagonal with respect to and 2 When y is transformed the maximum likelihood estimators of and 2 are positively correlated because both parameters re ect the scaling of the dependent variable in the model This result may seem counterintuitive Consider the difference in the variance estimators that arises when a linear and a loglinear model are estimated The variance of ln y around its mean is obviously different from that of y around its mean By contrast consider what happens when only the independent variables are transformed for example by the Box Cox transformation The slope estimators vary accordingly but in such a way that the variance of y around its conditional mean will stay constant 16
    Example 17 5 A Generalized Production Function

    The Cobb Douglas function has often been used to study production and cost Among the assumptions of this model is that the average cost of production increases or decreases monotonically with increases in output This assumption is in direct contrast to the standard textbook treatment of a U shaped average cost curve as well as to a large amount of empirical evidence See Example 7 3 for a well known application To relax this assumption Zellner
    16 See

    Seaks and Layson 1983

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    499

    TABLE 17 2

    Generalized Production Function Estimates
    Maximum Likelihood SE 1 SE 2 Nonlinear Least Squares

    Estimate

    1 2 3 2 ln L

    2 914822 0 350068 1 092275 0 106666 0 0427427 1 068567 8 939044

    0 44912 0 10019 0 16070 0 078702

    0 12534 0 094354 0 11498

    2 108925 0 257900 0 878388 0 031634 0 0151167 0 7655490 13 621256

    and Revankar 1970 proposed a generalization of the Cobb Douglas production function 17 Their model allows economies of scale to vary with output and to increase and then decrease as output rises ln y y ln 1 ln K ln L Note that the right hand side of their model is intrinsically linear according to the results of Section 7 3 3 The model as a whole however is intrinsically nonlinear due to the parametric transformation of y appearing on the left For Zellner and Revankar s production function the Jacobian of the transformation from i to yi is i yi 1 yi Some simpli cation is achieved by writing this as 1 yi yi The log likelihood is then
    n n

    ln L
    i 1

    ln 1 yi
    i 1

    ln yi

    n 1 n ln 2 ln 2 2 2 2 2

    n

    i2
    i 1

    where i ln yi yi 1 2 ln capitali 3 ln labori Estimation of this model is straightforward For a given value of and 2 are estimated by linear least squares Therefore to estimate the full set of parameters we could scan over the range of zero to one for The value of that with its associated least squares estimates of and 2 maximizes the log likelihood function provides the maximum likelihood estimate This procedure was used by Zellner and Revankar The results given in Table 17 2 were obtained by maximizing the log likelihood function directly instead The statewide data on output capital labor and number of establishments in the transportation industry used in Zellner and Revankar s study are given in Appendix Table F9 2 and Example 16 6 For this application y value added per rm K capital per rm and L labor per rm Maximum likelihood and nonlinear least squares estimates are shown in Table 17 2 The asymptotic standard errors for the maximum likelihood estimates are labeled SE 1 These are computed using the BHHH form of the asymptotic covariance matrix The second set SE 2 are computed treating the estimate of as xed they are the usual linear least squares results using ln y y as the dependent variable in a linear regression Clearly these results would be very misleading The nal column of Table 10 2 lists the simple nonlinear least squares estimates No standard errors are given because there is no appropriate formula for computing the asymptotic covariance matrix The sum of squares does not provide an appropriate method for computing the pseudoregressors for the parameters in the transformation The last two rows of the table display the sum of squares and the log likelihood function evaluated at the parameter estimates As expected the log likelihood is much larger at the maximum likelihood estimates In contrast the nonlinear least squares estimates lead to a much lower sum of squares least squares is still least squares
    17 An alternative approach is to model costs directly with a exible functional form such as the translog model

    This approach is examined in detail in Chapter 14

    Greene 50240

    book

    June 26 2002

    15 8

    500

    CHAPTER 17 Maximum Likelihood Estimation Example 17 6 An LM Test for Log Linearity

    A natural generalization of the Box Cox regression model Section 9 3 2 is y x


    17 54

    where z z 1 This form includes the linear 1 and loglinear 0 models as special cases The Jacobian of the transformation is d dy y 1 The log likelihood function for the model with normally distributed disturbances is n n ln L ln 2 ln 2 1 2 2
    n

    ln yi
    i 1

    1 2 2

    n

    yi xi
    i 1

    2



    17 55

    The MLEs of and are computed by maximizing this function The estimator of 2 is the mean squared residual as usual We can use a one dimensional grid search over for a given value of the MLE of is least squares using the transformed data It must be remembered however that the criterion function includes the Jacobian term We will use the BHHH estimator of the asymptotic covariance matrix for the maximum likelihood The derivatives of the log likelihood are



    ln L





    ln L ln L
    2 where

    n

    i 1

    ln yi i yi 2 1 i2
    2 2 2

    i xi 2



    K

    k
    k 1

    xi k



    1



    n

    gi
    i 1

    17 56

    z 1 z ln z z 1 1 z ln z z 17 57 2 See Exercise 6 in Chapter 9 The estimator of the asymptotic covariance matrix for the maximum likelihood estimator is given in 17 53 The Box Cox model provides a framework for a speci cation test of linearity versus loglinearity To assemble this result consider rst the basic model y f x 1 2 1 2 x
    The pseudoregressors are x1 1 x2 x x3 2 x as given above We now consider a Lagrange multiplier test of the hypothesis that equals zero The test is carried out by rst regressing y on a constant and ln x i e the regressor evaluated at 0 and 2 then computing nR in the regression of the residuals from this rst regression on x1 x2 and x3 also evaluated at 0 The rst and second of these are 1 and ln x To obtain the third we require x3 0 2 lim 0 x Applying L Hopital s rule to the right hand side of 12 57 differentiate numerator and denominator with respect to This produces

    x x lim x ln x 2 0 0 lim



    1 1 lim x ln x 2 ln x 2 2 0 2

    Therefore lim 0 x3 2 1 ln x 2 The Lagrange multiplier test is carried out in two steps 2 First we regress y on a constant and ln x and compute the residuals Second we regress these residuals on a constant ln x and b2 1 ln2 x where b2 is the coef cient on ln x in 2 the rst regression The Lagrange multiplier statistic is nR2 from the second regression To generalize this procedure to several regressors we would use the logs of all the regressors at the rst step Then the additional regressor for the second regression would be K x k 1

    bk 1 ln2 xk 2

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    501

    where the sum is taken over all the variables that are transformed in the original model and the bk s are the least squares coef cients in the rst regression By extending this process to the model of 17 54 we can devise a bona de test of log linearity against the more general model not linearity See Davidson and MacKinnon 1985 A test of linearity can be conducted using 1 instead Computing the various terms at 0 again we have i ln yi 1 2 ln xi where as before 1 and 2 are computed by the least squares regression of ln y on a constant 2 and ln x Let i 1 ln2 yi 2 1 ln2 xi Then 2



    ln xi i 2 gi ln y 2 i i
    i2 2 1 2 2

    i 2





    If there are K regressors in the model then the second component in gi will be a vector containing the logs of the variables whereas i in the third becomes i 12 ln yi 2
    K

    k
    k 1

    12 ln xi k 2

    Using the Berndt et al estimator given in 10 54 we can now construct the Lagrange multiplier statistic as
    n n 1 n

    LM 1
    2 i 1

    gi
    i 1

    gi gi
    i 1

    gi

    i G G G 1 G i

    where G is the n K 2 matrix whose columns are g1 through g K 2 and i is a column of 1s The usefulness of this approach for either of the models we have examined is that in testing the hypothesis it is not necessary to compute the nonlinear unrestricted Box Cox regression
    17 6 3 NONNORMAL DISTURBANCES THE STOCHASTIC FRONTIER MODEL

    This nal application will examine a regressionlike model in which the disturbances do not have a normal distribution The model developed here also presents a convenient platform on which to illustrate the use of the invariance property of maximum likelihood estimators to simplify the estimation of the model A lengthy literature commencing with theoretical work by Knight 1933 Debreu 1951 and Farrell 1957 and the pioneering empirical study by Aigner Lovell and Schmidt 1977 has been directed at models of production that speci cally account for the textbook proposition that a production function is a theoretical ideal 18 If y f x de nes a production relationship between inputs x and an output y then for any given x the observed value of y must be less than or equal to f x The implication for an empirical regression model is that in a formulation such as y h x u u must be negative Since the theoretical production function is an ideal the frontier of ef cient
    18 A

    survey by Greene 1997b appears in Pesaran and Schmidt 1997 Kumbhakar and Lovell 2000 is a comprehensive reference on the subject

    Greene 50240

    book

    June 26 2002

    15 8

    502

    CHAPTER 17 Maximum Likelihood Estimation

    production any nonzero disturbance must be interpreted as the result of inef ciency A strictly orthodox interpretation embedded in a Cobb Douglas production model might produce an empirical frontier production model such as ln y 1
    k k ln xk

    u u 0

    The gamma model described in Example 5 1 was an application One sided disturbances such as this one present a particularly dif cult estimation problem The primary theoretical problem is that any measurement error in ln y must be embedded in the disturbance The practical problem is that the entire estimated function becomes a slave to any single errantly measured data point Aigner Lovell and Schmidt proposed instead a formulation within which observed deviations from the production function could arise from two sources 1 productive inef ciency as we have de ned it above and that would necessarily be negative and 2 idiosyncratic effects that are speci c to the rm and that could enter the model with either sign The end result was what they labeled the stochastic frontier ln y 1 1
    k k ln xk k k ln xk 2 u v u 0 v N 0 v



    The frontier for any particular rm is h x v hence the name stochastic frontier The inef ciency term is u a random variable of particular interest in this setting Since the data are in log terms u is a measure of the percentage by which the particular observation fails to achieve the frontier ideal production rate To complete the speci cation they suggested two possible distributions for the inef ciency term the absolute value of a normally distributed variable and an exponentially distributed variable The density functions for these two compound distributions 2 2 are given by Aigner Lovell and Schmidt let v u u v u v 1 2 and z the probability to the left of z in the standard normal distribution see Sections B 4 1 and E 5 6 For the half normal model ln h i ln whereas for the exponential model 1 2 ln h i v ln 2 v i ln 2 i v v 1 2 log 2 1 i 2
    2

    ln

    i



    Both these distributions are asymmetric We thus have a regression model with a nonnormal distribution speci ed for the disturbance The disturbance has a nonzero mean as well E u 2 1 2 for the half normal model and 1 for the exponential model Figure 17 3 illustrates the density for the half normal model with 1 and 2 By writing 0 1 E and E we obtain a more conventional formulation ln y 0
    k k ln xk



    which does have a disturbance with a zero mean but an asymmetric nonnormal distribution The asymmetry of the distribution of does not negate our basic results for least squares in this classical regression model This model satis es the assumptions of the

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    503

    Probability Density for the Stochastic Frontier 70

    56

    Density

    42

    28

    14

    00 4 0
    FIGURE 17 3

    2 8

    1 6

    4

    8

    2 0

    Density for the Disturbance in the Stochastic Frontier Model

    Gauss Markov theorem so least squares is unbiased and consistent save for the constant term and ef cient among linear unbiased estimators In this model however the maximum likelihood estimator is not linear and it is more ef cient than least squares We will work through maximum likelihood estimation of the half normal model in detail to illustrate the technique The log likelihood is ln L n ln n2 1 ln 2 2
    n i 1

    i

    2

    n


    i 1

    ln

    i

    This is not a particularly dif cult log likelihood to maximize numerically Nonetheless it is instructive to make use of a convenience that we noted earlier Recall that maximum likelihood estimators are invariant to one to one transformation If we let 1 and 1 the log likelihood function becomes ln L n ln n2 1 ln 2 2
    n n

    yi xi 2
    i 1 i 1

    ln yi xi

    As you could verify by trying the derivations this transformation brings a dramatic simpli cation in the manipulation of the log likelihood and its derivatives We will make repeated use of the functions i i yi xi yi xi
    i

    i i i

    i i i

    Greene 50240

    book

    June 26 2002

    15 8

    504

    CHAPTER 17 Maximum Likelihood Estimation

    The second of these is the derivative of the function in the nal term in log L The third is the derivative of i with respect to its argument i 0 for all values of i It will also be convenient to de ne the K 1 1 columns vectors zi xi yi and ti 0 1 The likelihood equations are ln L
    n n n

    ti
    i 1 i 1

    i zi
    i 1 n

    i zi 0

    ln L and the second derivatives are
    n

    i i 0
    i 1

    H
    i 1

    2 i 1 zi zi i i i zi

    ti ti i i i zi 2 i i 0

    0 0



    The estimator of the asymptotic covariance matrix for the directly estimated parameters is Est Asy Var H
    1



    There are two sets of transformations of the parameters in our formulation In order to recover estimates of the original structural parameters 1 and we need only transform the MLEs Since these transformations are one to one the MLEs of and are 1 and To compute an asymptotic covariance matrix for these estimators we will use the delta method which will use the derivative matrix 1 I 1 2 0 G 0 1 2 0 0 0 1 Then for the recovered parameters we Est Asy Var G H
    1

    G

    For the half normal model we would also rely on the invariance of maximum likelihood 2 estimators to recover estimates of the deeper variance parameters v 2 1 2 2 and u 2 2 1 2 The stochastic frontier model is a bit different from those we have analyzed previously in that the disturbance is the central focus of the analysis rather than the catchall for the unknown and unknowable factors omitted from the equation Ideally we would like to estimate ui for each rm in the sample to compare them on the basis of their productive ef ciency The parameters of the production function are usually of secondary interest in these studies Unfortunately the data do not permit a direct estimate since with estimates of in hand we are only able to compute a direct estimate of y x Jondrow et al 1982 however have derived a useful approximation that is now the standard measure in these settings E u z z 1 2 1 z z

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    505

    TABLE 17 3

    Estimated Stochastic Frontier Functions
    Least Squares Half Normal Model Standard Estimate Error t Ratio Exponential Model Estimate Standard Error t Ratio

    Standard Coef cient Estimate Error t Ratio

    Constant k l u v log L

    1 844 0 245 0 805 0 236 2 2537

    0 234 0 107 0 126

    7 896 2 297 6 373

    2 081 0 259 0 780 0 282 0 222 0 190 1 265 2 4695

    0 422 0 144 0 170 0 087 1 620

    4 933 1 800 4 595 3 237 0 781

    2 069 0 262 0 770 0 136 0 171 7 398 2 8605

    0 290 0 120 0 138 0 054 3 931

    7 135 2 184 5 581 3 170 1 882

    for the half normal model and E u z v z v z v
    2 z v

    for the exponential model These values can be computed using the maximum likelihood estimates of the structural parameters in the model In addition a structural parameter of interest is the proportion of the total variance of that is due to the inef ciency term 2 2 For the half normal model Var Var u Var v 1 2 u v whereas for 2 2 the exponential model the counterpart is 1 v
    Example 17 7 Stochastic Frontier Model

    Appendix Table F9 2 lists 25 statewide observations used by Zellner and Revankar 1970 to study production in the transportation equipment manufacturing industry We have used these data to estimate the stochastic frontier models Results are shown in Table 17 3 19 The Jondrow et al 1982 estimates of the inef ciency terms are listed in Table 17 4 The estimates of the parameters of the production function 1 2 and 3 are fairly similar but the variance parameters u and v appear to be quite different Some of the parameter difference 2 is illusory however The variance components for the half normal model are 1 2 u 0 0179 and v2 0 0361 whereas those for the exponential model are 1 2 0 0183 and v2 0 0293 In each case about one third of the total variance of is accounted for by the variance of u
    17 6 4 CONDITIONAL MOMENT TESTS OF SPECIFICATION

    A spate of studies has shown how to use conditional moment restrictions for speci cation testing as well as estimation 20 The logic of the conditional moment CM based speci cation test is as follows The model speci cation implies that certain moment restrictions will hold in the population from which the data were drawn If the speci cation
    19 N

    is the number of establishments in the state Zellner and Revankar used per establishment data in their study The stochastic frontier model has the intriguing property that if the least squares residuals are skewed in the positive direction then least squares with 0 maximizes the log likelihood This property in fact characterizes the data above when scaled by N Since that leaves a not particularly interesting example and it does not occur when the data are not normalized for purposes of this illustration we have used the unscaled data to produce Table 17 3 We do note that this result is a common vexing occurrence in practice for example Pagan and Vella 1989

    20 See

    Greene 50240

    book

    June 26 2002

    15 8

    506

    CHAPTER 17 Maximum Likelihood Estimation

    TABLE 17 4 State

    Estimated Inef ciencies
    Exponential State Half Normal Exponential

    Half Normal

    Alabama California Connecticut Florida Georgia Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Wisconsin

    0 2011 0 1448 0 1903 0 5175 0 1040 0 1213 0 2113 0 2493 0 1010 0 0563 0 2033 0 2226 0 1407

    0 1459 0 0972 0 1348 0 5903 0 0714 0 0830 0 1545 0 2007 0 0686 0 0415 0 1507 0 1725 0 0971

    Maryland Massachusetts Michigan Missouri New Jersey New York Ohio Pennsylvania Texas Virginia Washington West Virginia

    0 1353 0 1564 0 1581 0 1029 0 0958 0 2779 0 2291 0 1501 0 2030 0 1400 0 1105 0 1556

    0 0925 0 1093 0 1076 0 0704 0 0659 0 2225 0 1698 0 1030 0 1455 0 0968 0 0753 0 1124

    is correct then the sample data should mimic the implied relationships For example in the classical regression model the assumption of homoscedasticity implies that the disturbance variance is independent of the regressors As such E xi yi xi 2 2 E xi i2 2 0

    If on the other hand the regression is heteroscedastic in a way that depends on xi then this covariance will not be zero If the hypothesis of homoscedasticity is correct then we would expect the sample counterpart to the moment condition r 1 n
    n

    xi ei2 s 2
    i 1

    where ei is the OLS residual to be close to zero This computation appears in Breusch and Pagan s LM test for homoscedasticity See Section 11 4 3 The practical problems to be solved are 1 to formulate suitable moment conditions that do correspond to the hypothesis test which is usually straightforward 2 to devise the appropriate sample counterpart and 3 to devise a suitable measure of closeness to zero of the sample moment estimator The last of these will be in the framework of the Wald statistics that we have examined at various points in this book So the problem will be to devise the appropriate covariance matrix for the sample moments Consider a general case in which the moment condition is written in terms of vari ables in the model yi xi zi and parameters as in the linear regression model The sample moment can be written r 1 n
    n

    ri yi xi zi
    i 1

    1 n

    n

    ri
    i 1

    17 58

    The hypothesis is that based on the true E ri 0 Under the null hypothesis that E ri 0 and assuming that plim and that a central limit theorem Theorem D 18 or D 19 applies to n r so that d nr N 0

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    507

    for some covariance matrix statistic

    that we have yet to estimate it follows that the Wald
    1 d nr r 2 J

    17 59

    where the degrees of freedom J is the number of moment restrictions being tested and is an estimate of Thus the statistic can be referred to the chi squared table It remains to determine the estimator of The full derivation of is fairly complicated See Pagan and Vella 1989 pp S32 S33 But when the vector of parameter estimators is a maximum likelihood estimator as it would be for the least squares estimator with normally distributed disturbances and for most of the other estimators we consider a surprisingly simple estimator can be used Suppose that the parameter vector used to compute the moments above is obtained by solving the equations 1 n
    n

    g yi xi zi
    i 1

    1 n

    n

    gi 0
    i 1

    17 60

    where is the estimated parameter vector e g in the linear model For the linear regression model that would be the normal equations 1 1 Xe n n
    n

    xi yi xi b 0
    i 1

    Let the matrix G be the n K matrix with i th row equal to gi In a maximum likelihood problem G is the matrix of derivatives of the individual terms in the log likelihood function with respect to the parameters This is the G used to compute the BHHH estimator of the information matrix See 17 18 Let R be the n J matrix whose i th row is ri Pagan and Vella show that for maximum likelihood estimators can be estimated using 1 17 61 R R R G G G 1 G R 21 n This equation looks like an involved matrix computation but it is simple with any regression program Each element of S is the mean square or cross product of the least squares residuals in a linear regression of a column of R on the variables in G 22 Therefore the operational version of the statistic is S 1 i R R R R G G G 1 G R 1 R i 17 62 n where i is an n 1 column of ones which once again is referred to the appropriate critical value in the chi squared table This result provides a joint test that all the moment conditions are satis ed simultaneously An individual test of just one of the moment C nr S 1 r
    21 It

    might be tempting just to use 1 n R R This idea would be incorrect because S accounts for R being a function of the estimated parameter vector that is converging to its probability limit at the same rate as the sample moments are converging to theirs

    22 If

    the estimator is not an MLE then estimation of is more involved but also straightforward using basic matrix algebra The advantage of 17 62 is that it involves simple sums of variables that have already been computed to obtain and r Note as well that if has been estimated by maximum likelihood then the term G G 1 is the BHHH estimator of the asymptotic covariance matrix of If it were more convenient then this estimator could be replaced with any other appropriate estimator of Asy Var

    Greene 50240

    book

    June 26 2002

    15 8

    508

    CHAPTER 17 Maximum Likelihood Estimation

    restrictions in isolation can be computed even more easily than a joint test For testing one of the L conditions say the th one the test can be carried out by a simple t test of whether the constant term is zero in a linear regression of the th column of R on a constant term and all the columns of G In fact the test statistic in 17 62 could also be obtained by stacking the J columns of R and treating the L equations as a seemingly unrelated regressions model with i G as the identical regressors in each equation and then testing the joint hypothesis that all the constant terms are zero See Section 14 2 3
    Example 17 8 Testing for Heteroscedasticity in the Linear Regression Model

    Suppose that the linear model is speci ed as yi 1 2 xi 3 zi i To test whether E zi2 i2 2 0

    we linearly regress zi2 ei2 s2 on a constant ei xi ei and zi ei A standard t test of whether the constant term in this regression is zero carries out the test To test the joint hypothesis that there is no heteroscedasticity with respect to both x and z we would regress both xi2 ei2 s2 and zi2 ei2 s2 i on 1 ei xi ei zi ei and collect the two columns of residuals in V Then S 1 n V V The moment vector would be r The test statistic would now be C nr S 1 r nr 1 VV n
    1

    1 n

    n

    xi zi

    ei2 s2

    i 1

    r

    We will examine other conditional moment tests using this method in Section 22 3 4 where we study the speci cation of the censored regression model

    17 7

    TWO STEP MAXIMUM LIKELIHOOD ESTIMATION

    The applied literature contains a large and increasing number of models in which one model is embedded in another which produces what are broadly known as two step estimation problems Consider an admittedly contrived example in which we have the following Model 1 Expected number of children E y1 x1 1 Model 2 Decision to enroll in job training y2 a function of x2 2 E y1 x1 1 There are two parameter vectors 1 and 2 The rst appears in the second model although not the reverse In such a situation there are two ways to proceed Full information maximum likelihood FIML estimation would involve forming the joint distribution f y1 y2 x1 x2 1 2 of the two random variables and then maximizing

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    509

    the full log likelihood function
    n

    ln L
    i 1

    f yi 1 yi 2 xi 1 xi 2 1 2

    A second or two step limited information maximum likelihood LIML procedure for this kind of model could be done by estimating the parameters of model 1 since it does not involve 2 and then maximizing a conditional log likelihood function using the estimates from Step 1
    n

    ln L
    i 1

    f yi 2 xi 2 2 xi 1 1

    There are at least two reasons one might proceed in this fashion First it may be straightforward to formulate the two separate log likelihoods but very complicated to derive the joint distribution This situation frequently arises when the two variables being modeled are from different kinds of populations such as one discrete and one continuous which is a very common case in this framework The second reason is that maximizing the separate log likelihoods may be fairly straightforward but maximizing the joint log likelihood may be numerically complicated or dif cult 23 We will consider a few examples Although we will encounter FIML problems at various points later in the book for now we will present some basic results for two step estimation Proofs of the results given here can be found in an important reference on the subject Murphy and Topel 1985 Suppose then that our model consists of the two marginal distributions f1 y1 x1 1 and f2 y2 x1 x2 1 2 Estimation proceeds in two steps 1 Estimate 1 by maximum likelihood in Model 1 Let 1 n V1 be n times any of the estimators of the asymptotic covariance matrix of this estimator that were discussed in Section 17 4 6 Estimate 2 by maximum likelihood in model 2 with 1 inserted in place of 1 as if 2 be n times any appropriate estimator of the it were known Let 1 n V asymptotic covariance matrix of 2

    2

    The argument for consistency of 2 is essentially that if 1 were known then all our results for MLEs would apply for estimation of 2 and since plim 1 1 asymptotically this line of reasoning is correct But the same line of reasoning is not suf cient to justify using 1 n V2 as the estimator of the asymptotic covariance matrix of 2 Some correction is necessary to account for an estimate of 1 being used in estimation of 2 The essential result is the following

    23 There is a third possible motivation If either model is misspeci ed then the FIML estimates of both models

    will be inconsistent But if only the second is misspeci ed at least the rst will be estimated consistently Of course this result is only half a loaf but it may be better than none

    Greene 50240

    book

    June 26 2002

    15 8

    510

    CHAPTER 17 Maximum Likelihood Estimation

    THEOREM 17 8 Asymptotic Distribution of the Two Step MLE Murphy and Topel 1985 If the standard regularity conditions are met for both log likelihood functions then the second step maximum likelihood estimator of 2 is consistent and asymptotically normally distributed with asymptotic covariance matrix V 2 where 1 V2 V2 CV1 C RV1 C CV1 R V2 n

    V1 Asy Var n 1 1 based on ln L1 V2 Asy Var n 2 2 based on ln L2 1 1 ln L2 n 2 ln L2 1 R E 1 ln L2 n 2 ln L1 1

    C E

    The correction of the asymptotic covariance matrix at the second step requires some additional computation Matrices V1 and V2 are estimated by the respective uncorrected covariance matrices Typically the BHHH estimators V1 and V2 1 n
    n i 1

    1 n

    n i 1

    ln fi 1 1 ln fi 2 2

    ln fi 1 1 ln fi 2
    2

    1

    1

    are used The matrices R and C are obtained by summing the individual observations on the cross products of the derivatives These are estimated with 1 C n and 1 R n
    n i 1 n i 1

    ln fi 2 2 ln fi 2 2

    ln fi 2
    1

    ln fi 1
    1

    Example 17 9

    Two Step ML Estimation

    Continuing the example discussed at the beginning of this section we suppose that yi 2 is a binary indicator of the choice whether to enroll in the program yi 2 1 or not yi 2 0 and that the probabilities of the two outcomes are Prob yi 2 1 xi 1 xi 2 exi 2 E yi 1 xi 1 1 exi 2 E yi 1 xi 1

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    511

    and Prob yi 2 0 xi 1 xi 2 1 Prob yi 2 1 xi 1 xi 2 where xi 2 is some covariates that might in uence the decision such as marital status or age and xi 1 are determinants of family size This setup is a logit model We will develop this model more fully in Chapter 21 The expected value of yi 1 appears in the probability Remark The expected rather than the actual value was chosen deliberately Otherwise the models would differ substantially In our case we might view the difference as that between an ex ante decision and an ex post one Suppose that the number of children can be described by a Poisson distribution see Section B 4 8 dependent on some variables xi 1 such as education age and so on Then Prob yi 1 j xi 1 and suppose as is customary that E yi 1 i exp xi 1 The models involve where 1 In fact it is unclear what the joint distribution of y1 and y2 might be but two step estimation is straightforward For model 1 the log likelihood and its rst derivatives are
    n

    e i i j
    j

    j 0 1

    ln L 1
    i 1 n

    ln f1 yi 1 xi 1
    n


    i 1

    i yi 1 ln i ln yi 1
    i 1 n n

    exp xi 1 yi 1 xi 1 ln yi 1

    ln L 1

    yi 1 i xi 1
    i 1 i 1

    ui xi 1

    Computation of the estimates is developed in Chapter 21 Any of the three estimators of V1 is also easy to compute but the BHHH estimator is most convenient so we use V1 1 n
    n 1

    ui2 xi 1 xi 1
    i 1



    In this and the succeeding summations we are actually estimating expectations of the various matrices We can write the density function for the second model as f2 yi 2 xi 1 xi 2 Pi
    yi 2

    1 Pi 1 yi 2

    where Pi Prob yi 2 1 xi 1 xi 2 as given earlier Then
    n

    ln L 2
    i 1

    yi 2 ln Pi 1 yi 2 ln 1 Pi

    For convenience let xi 2 xi 2 exp xi 1 and recall that 2 Then
    n

    ln L 2
    i 1

    yi 2 xi 2 2 ln 1 exp xi 2 2 1 yi 2 ln 1 exp xi 2 2

    So at the second step we create the additional variable append it to xi 2 and estimate the logit model as if and this additional variable were actually observed instead of estimated The maximum likelihood estimates of are obtained by maximizing this function See

    Greene 50240

    book

    June 26 2002

    15 8

    512

    CHAPTER 17 Maximum Likelihood Estimation

    Chapter 21 After a bit of manipulation we nd the convenient result that ln L 2 2
    n n

    yi 2 Pi xi 2
    i 1 i 1

    vi xi 2

    Once again any of the three estimators could be used for estimating the asymptotic covariance matrix but the BHHH estimator is convenient so we use V2 1 n
    n 1

    vi2 xi 2 xi 2
    i 1



    For the nal step we must correct the asymptotic covariance matrix using C and R What remains to derive the few lines are left for the reader is ln L 2 So using our estimates C 1 n
    n n

    vi exp xi 1 xi 1
    i 1

    vi2 exp xi 1 xi 2 xi 1
    i 1

    and

    R

    1 n

    n

    ui vi xi 2 xi 1
    i 1

    We can now compute the correction

    In many applications the covariance of the two gradients R converges to zero When the rst and second step estimates are based on different samples R is exactly zero For example in our application above R in 1 ui vi xi 2 xi 1 The two residuals u and v may well be uncorrelated This assumption must be checked on a model by model basis but in such an instance the third and fourth terms in V vanish asymptotically and what 2 remains is the simpler alternative V 1 n V2 V2 CV1 C V2 2 We will examine some additional applications of this technique including an empirical implementation of the preceding example later in the book Perhaps the most common application of two step maximum likelihood estimation in the current literature especially in regression analysis involves inserting a prediction of one variable into a function that describes the behavior of another

    17 8

    MAXIMUM SIMULATED LIKELIHOOD ESTIMATION

    The technique of maximum simulated likelihood MSL is essentially a classical sampling theory counterpart to the hierarchical Bayesian estimator we considered in Section 16 2 4 Since the celebrated paper of Berry Levinsohn and Pakes 1995 and a related literature advocated by McFadden and Train 2000 maximum simulated likelihood estimation has been used in a large and growing number of studies based on log likelihoods that involve integrals that are expectations 24 In this section we will lay out some general results for MSL estimation by developing a particular application
    24 A

    major reference for this set of techniques is Gourieroux and Monfort 1996

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    513

    the random parameters model This general modeling framework has been used in the majority of the received applications We will then continue the application to the discrete choice model for panel data that we began in Section 16 2 4 The density of yit when the parameter vector is i is f yit xit i The parameter vector i is randomly distributed over individuals according to i zi vi where zi is the mean of the distribution which depends on time invariant individual characteristics as well as parameters yet to be estimated and the random variation comes from the individual heterogeneity vi This random vector is assumed to have mean zero and covariance matrix The conditional density of the parameters is denoted g i zi g vi zi

    where g is the underlying marginal density of the heterogeneity For the T observations in group i the joint conditional density is
    T

    f yi Xi i
    t 1

    f yit xit i

    The unconditional density for yi is obtained by integrating over i f yi Xi zi E i f yi Xi i
    i

    f yi Xi i g i zi d i

    Collecting terms and making the transformation from vi to i the true log likelihood would be
    n T

    ln L
    i 1 n

    ln
    vi t 1

    f yit xit f yi Xi
    vi

    zi vi g vi dvi


    i 1

    ln

    zi vi g vi dvi

    Each of the n terms involves an expectation over vi The end result of the integration is a function of which is then maximized As in the previous applications it will not be possible to maximize the log likelihood in this form because there is no closed form for the integral We have considered two approaches to maximizing such a log likelihood In the latent class formulation it is assumed that the parameter vector takes one of a discrete set of values and the loglikelihood is maximized over this discrete distribution as well as the structural parameters See Section 16 2 3 The hierarchical Bayes procedure used Markov Chain Monte Carlo methods to sample from the joint posterior distribution of the underlying parameters and used the empirical mean of the sample of draws as the estimator We now consider a third approach to estimating the parameters of a model of this form maximum simulated likelihood estimation The terms in the log likelihood are each of the form ln Li Evi f yi Xi zi vi As noted we do not have a closed form for this function so we cannot compute it directly Suppose we could sample randomly from the distribution of vi If an appropriate law

    Greene 50240

    book

    June 26 2002

    15 8

    514

    CHAPTER 17 Maximum Likelihood Estimation

    of large numbers can be applied then 1 lim R R
    R

    f yi Xi
    r 1

    zi vir Evi f yi Xi

    zi vi

    where vir is the rth random draw from the distribution This suggests a strategy for computing the log likelihood We can substitute this approximation to the expectation into the log likelihood function With suf cient random draws the approximation can be made as close to the true function as desired The theory for this approach is discussed in Gourieroux and Monfort 1996 Bhat 1999 and Train 1999 2002 Practical details on applications of the method are given in Greene 2001 A detail to add concerns how to sample from the distribution of vi There are many possibilities but for now we consider the simplest case the multivariate normal distribution Write in the Cholesky form LL where L is a lower triangular matrix Now let uir be a vector of K independent draws from the standard normal distribution Then a draw from the multivariate distribution with covariance matrix is simply vir Luir The simulated log likelihood is
    n

    ln LS
    i 1

    ln

    1 R

    R r 1

    T

    f yit xit
    t 1

    zi Luir



    The resulting function is maximized with respect to and L This is obviously not a simple calculation but it is feasible and much easier than trying to manipulate the integrals directly In fact for most problems to which this method has been applied the computations are surprisingly simple The intricate part is obtaining the function and its derivatives But the functions are usually index function models that involve xit i which greatly simpli es the derivations Inference in this setting does not involve any new results The estimated asymptotic covariance matrix for the estimated parameters is computed by manipulating the derivatives of the simulated log likelihood The Wald and likelihood ratio statistics are also computed the way they would usually be As before we are interested in estimating person speci c parameters A prior estimate might simply use zi but this would not use all the information in the sample A posterior estimate would compute Evi i zi
    R r 1 ir f yi Xi ir R r 1 f yi Xi ir

    ir zi Luir

    Mechanical details on computing the MSLE are omitted The interested reader is referred to Gourieroux and Monfort 1996 Train 2000 2002 and Greene 2001 2002 for details
    Example 17 10 Maximum Simulated Likelihood Estimation of a Binary Choice Model

    We continue Example 16 5 where estimates of a binary choice model for product innovation are obtained The model is for Prob yi t 1 xi t i where yi t 1 if rm i realized a product innovation in year t and 0 if not

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    515

    The independent variables in the model are xi t 1 constant xi t 2 log of sales xi t 3 relative size ratio of employment in business unit to employment in the industry xi t 4 ratio of industry imports to industry sales imports xi t 5 ratio of industry foreign direct investment to industry sales imports xi t 6 productivity ratio of industry value added to industry employment xi t 7 dummy variable indicating the rm is in the raw materials sector xi t 8 dummy variable indicating the rm is in the investment goods sector The sample consists of 1 270 German manufacturing rms observed for ve years 1984 1988 The density that enters the log likelihood is f yi t xi t i Prob yi t xi t i where i vi vi N 0 To be consistent with Bertschek and Lechner 1998 we did not t any rm speci c timeinvariant components in the main equation for i Table 17 5 presents the estimated coef cients for the basic probit model in the rst column The estimates of the means are shown in the second column There appear to be large differences in the parameter estimates though this can be misleading since there is large variation across the rms in the posterior estimates The third column presents the square roots of the implied diagonal elements of computed as the diagonal elements of LL These estimated standard deviations are for the underlying distribution of the parameter in the model they are not estimates of the standard deviation of the sampling distribution of the estimator For the mean parameter that is shown in parentheses in the second column The fourth column presents the sample means and standard deviations of the 1 270 estimated posterior
    TABLE 17 5

    2 yi t 1 xi t i

    yi t 0 1

    Estimated Random Parameters Model
    RP Means RP Std Devs Empirical Distn Posterior

    Probit

    Constant lnSales Rel Size Import FDI Prod RawMtls Invest ln L

    1 96 0 23 0 18 0 022 1 07 0 14 1 13 0 15 2 85 0 40 2 34 0 72 0 28 0 081 0 19 0 039 4114 05

    3 91 0 20 0 36 0 019 6 01 0 22 1 51 0 13 3 81 0 33 5 10 0 73 0 31 0 075 0 27 0 032

    2 70 0 28 5 99 0 84 6 51 13 03 1 65 1 42 3498 654

    3 27 0 57 0 32 0 15 3 33 2 25 2 01 0 58 3 76 1 69 8 15 8 29 0 18 0 57 0 27 0 38

    3 38 2 14 0 34 0 09 2 58 1 30 1 81 0 74 3 63 1 98 5 48 1 78 0 08 0 37 0 29 0 13

    Greene 50240

    book

    June 26 2002

    15 8

    516

    CHAPTER 17 Maximum Likelihood Estimation

    estimates of the coef cients The last column repeats the estimates for the latent class model The agreement in the two sets of estimates is striking in view of the crude approximation given by the latent class model Figures 17 4a and b present kernel density estimators of the rm speci c probabilities computed at the 5 year means for the random parameters model and with the original probit estimates The estimated probabilities are strikingly similar to the latent class model and also fairly similar to though smoother than the probit estimates
    FIGURE 17 4a Probit Probabilities

    Kernel Density Estimate for PPR 3 30

    2 64

    Density

    1 98

    1 32

    0 66

    0 00 0

    2

    4

    6 PPR

    8

    1 0

    1 2

    FIGURE 17 4b

    Random Parameters Probabilities

    Kernel Density Estimate for PRI 1 60

    1 28

    Density

    0 96

    0 64

    0 32

    0 00 2

    0

    2

    4 PRI

    6

    8

    1 0

    1 2

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    517

    Figure 17 5 shows the kernel density estimate for the rm speci c estimates of the log sales coef cient The comparison to Figure 16 5 shows some striking difference The random parameters model produces estimates that are similar in magnitude but the distributions are actually quite different Which should be preferred Only on the basis that the three point discrete latent class model is an approximation to the continuous variation model we would prefer the latter
    FIGURE 17 5a Random Parameters sales

    Kernel Density Estimate for BS 6 40

    5 12

    Density

    3 84

    2 56

    1 28

    0 00 2

    1

    0

    1

    2 BS

    3

    4

    5

    6

    7

    FIGURE 17 5b

    Latent Class Model sales

    Kernel Density Estimate for BSALES 7 20

    5 76

    Density

    4 32

    2 88

    1 44

    0 00 2

    3

    4 BSALES

    5

    6

    Greene 50240

    book

    June 26 2002

    15 8

    518

    CHAPTER 17 Maximum Likelihood Estimation

    17 9

    PSEUDO MAXIMUM LIKELIHOOD ESTIMATION AND ROBUST ASYMPTOTIC COVARIANCE MATRICES

    Maximum likelihood estimation requires complete speci cation of the distribution of the observed random variable If the correct distribution is something other than what we assume then the likelihood function is misspeci ed and the desirable properties of the MLE might not hold This section considers a set of results on an estimation approach that is robust to some kinds of model misspeci cation For example we have found that in a model if the conditional mean function is E y x x then certain estimators such as least squares are robust to specifying the wrong distribution of the disturbances That is LS is MLE if the disturbances are normally distributed but we can still claim some desirable properties for LS including consistency even if the disturbances are not normally distributed This section will discuss some results that relate to what happens if we maximize the wrong log likelihood function and for those cases in which the estimator is consistent despite this how to compute an appropriate asymptotic covariance matrix for it 25 Let f yi xi be the true probability density for a random variable yi given a set of covariates xi and parameter vector The log likelihood function is 1 n log L y X 1 n in 1 log f yi xi The MLE ML is the sample statistic that maximizes this function The division of log L by n does not affect the solution We maximize the log likelihood function by equating its derivatives to zero so the MLE is obtained by solving the set of empirical moment equations 1 n
    n i 1

    log f yi xi ML 1 n ML

    n

    di ML d ML 0
    i 1

    The population counterpart to the sample moment equation is 1 log L 1 E E n n
    n

    di E d 0
    i 1

    Using what we know about GMM estimators if E d 0 then ML is consistent and asymptotically normally distributed with asymptotic covariance matrix equal to VML G G 1 G Var d G G G 1 where G plim d Since d is the derivative vector G is 1 n times the expected Hessian of log L that is 1 n E H H As we saw earlier Var log L E H Collecting all seven appearances of 1 n E H we 1 obtain the familiar result VML E H All the ns cancel and Var d 1 n H Note that this result depends crucially on the result Var log L E H
    25 The following will sketch a set of results related to this estimation problem The important references on this

    subject are White 1982a Gourieroux Monfort and Trognon 1984 Huber 1967 and Amemiya 1985 A recent work with a large amount of discussion on the subject is Mittelhammer et al 2000 The derivations in these works are complex and we will only attempt to provide an intuitive introduction to the topic

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    519

    The maximum likelihood estimator is obtained by maximizing the function hn y n X 1 n i 1 log f yi xi This function converges to its expectation as n Since this function is the log likelihood for the sample it is also the case not proven here that as n it attains its unique maximum at the true parameter vector We used this result in proving the consistency of the maximum likelihood estimator Since plim hn y X E hn y X it follows by interchanging differentiation and the expectation operation that plim hn y X E hn y X But if this function achieves its maximum at then it must be the case that plim hn y X 0 An estimator that is obtained by maximizing a criterion function is called an M estimator Huber 1967 or an extremum estimator Amemiya 1985 Suppose that we obtain an estimator by maximizing some other function Mn y X that although not the log likelihood function also attains its unique maximum at the true as n Then the preceding argument might produce a consistent estimator with a known asymptotic distribution For example the log likelihood for a linear regression model with normally distributed disturbances with different variances 2 i is hn y X 1 n
    n i 1

    1 yi xi 2 log 2 2 i 2 2 i



    By maximizing this function we obtain the maximum likelihood estimator But we also examined another estimator simple least squares which maximizes Mn y X 1 n in 1 yi xi 2 As we showed earlier least squares is consistent and asymptotically normally distributed even with this extension so it quali es as an M estimator of the sort we are considering here Now consider the general case Suppose that we estimate by maximizing a criterion function Mn y X 1 n
    n

    log g yi xi
    i 1

    Suppose as well that plim Mn y X E Mn y X and that as n E Mn y X attains its unique maximum at Then by the argument we used above for the MLE plim Mn y X E Mn y X 0 Once again we have a set of moment equations for estimation Let E be the estimator that maximizes Mn y X Then the estimator is de ned by Mn y X E 1 n E
    n i 1

    log g yi xi E m E 0 E

    Thus E is a GMM estimator Using the notation of our earlier discussion G E is the symmetric Hessian of E Mn y X which we will denote 1 n E H M E H M E Proceeding as we did above to obtain VML we nd that the appropriate asymptotic covariance matrix for the extremum estimator would be 1 VE H M 1 n H M 1

    where Var log g yi xi and as before the asymptotic distribution is normal

    Greene 50240

    book

    June 26 2002

    15 8

    520

    CHAPTER 17 Maximum Likelihood Estimation

    The Hessian in V E can easily be estimated by using its empirical counterpart 1 Est H M E n
    n i 1

    2 log g yi xi E E E

    But remains to be speci ed and it is unlikely that we would know what function to use The important difference is that in this case the variance of the rst derivatives vector need not equal the Hessian so V E does not simplify We can however consistently estimate by using the sample variance of the rst derivatives 1 n
    n i 1

    log g yi xi

    log g yi xi

    If this were the maximum likelihood estimator then would be the BHHH estimator that we have used at several points For example for the least squares estimator in the heteroscedastic linear regression model the criterion is Mn y X 1 n in 1 yi xi 2 the solution is b G b 2 n X X and 1 n
    n

    2xi yi xi 2xi yi xi
    i 1

    4 n

    n

    ei2 xi xi
    i 1

    Collecting terms the 4s cancel and we are left precisely with the White estimator of 11 13 At this point we consider the motivation for all this weighty theory One disadvantage of maximum likelihood estimation is its requirement that the density of the observed random variable s be fully speci ed The preceding discussion suggests that in some situations we can make somewhat fewer assumptions about the distribution than a full speci cation would require The extremum estimator is robust to some kinds of speci cation errors One useful result to emerge from this derivation is an estimator for the asymptotic covariance matrix of the extremum estimator that is robust at least to some misspeci cation In particular if we obtain E by maximizing a criterion function that satis es the other assumptions then the appropriate estimator of the asymptotic covariance matrix is 1 Est V E H E 1 E H E 1 n
    1 If E is the true MLE then V E simpli es to H E In the current literature this estimator has been called the sandwich estimator There is a trend in the current literature to compute this estimator routinely regardless of the likelihood function It is worth noting that if the log likelihood is not speci ed correctly then the parameter estimators are likely to be inconsistent save for the cases such as those noted below so robust estimation of the asymptotic covariance matrix may be misdirected effort But if the likelihood function is correct then the sandwich estimator is unnecessary This method is not a general patch for misspeci ed models Not every likelihood function quali es as a consistent extremum estimator for the parameters of interest in the model One might wonder at this point how likely it is that the conditions needed for all this to work will be met There are applications in the literature in which this machinery has been used that probably do not meet these conditions such as the tobit model of Chapter 22 We have seen one important case Least squares in the generalized

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    521

    regression model passes the test Another important application is models of individual heterogeneity in cross section data Evidence suggests that simple models often overlook unobserved sources of variation across individuals in cross sections such as unmeasurable family effects in studies of earnings or employment Suppose that the correct model for a variable is h yi xi vi where vi is a random term that is not observed and is a parameter of the distribution of v The correct log likelihood function is i log f yi xi i log v h yi xi vi f vi dvi Suppose that we maximize some other pseudo log likelihood function i log g yi xi and then use the sandwich estimator to estimate the asymptotic covariance matrix of Does this produce a consistent estimator of the true parameter vector Surprisingly sometimes it does even though it has ignored the nuisance parameter We saw one case using OLS in the GR model with heteroscedastic disturbances Inappropriately tting a Poisson model when the negative binomial model is correct see Section 21 9 3 is another case For some speci cations using the wrong likelihood function in the probit model with proportions data Section 21 4 6 is a third These two examples are suggested with several others by Gourieroux Monfort and Trognon 1984 We do emphasize once again that the sandwich estimator in and of itself is not necessarily of any virtue if the likelihood function is misspeci ed and the other conditions for the M estimator are not met

    17 10

    SUMMARY AND CONCLUSIONS

    This chapter has presented the theory and several applications of maximum likelihood estimation which is the most frequently used estimation technique in econometrics after least squares The maximum likelihood estimators are consistent asymptotically normally distributed and ef cient among estimators that have these properties The drawback to the technique is that it requires a fully parametric detailed speci cation of the data generating process As such it is vulnerable to misspeci cation problems The next chapter considers GMM estimation techniques which are less parametric but more robust to variation in the underlying data generating process

    Key Terms and Concepts
    Asymptotic ef ciency Asymptotic normality Asymptotic variance BHHH estimator Box Cox model Conditional moment Identi cation Information matrix Information matrix equality Invariance Jacobian Lagrange multiplier test Likelihood equation Likelihood function Likelihood inequality Likelihood ratio test Limited information Nonlinear least squares Outer product of gradients

    estimator
    Regularity conditions Score test Stochastic frontier Two step maximum

    restrictions
    Concentrated log likelihood Consistency Cramer Rao lower bound Ef cient score Estimable parameters Full information maximum

    likelihood
    Wald statistic Wald test

    maximum likelihood
    Maximum likelihood

    likelihood

    estimator

    Greene 50240

    book

    June 26 2002

    15 8

    522

    CHAPTER 17 Maximum Likelihood Estimation

    Exercises 1 Assume that the distribution of x is f x 1 0 x In random sampling from this distribution prove that the sample maximum is a consistent estimator of Note You can prove that the maximum is the maximum likelihood estimator of But the usual properties do not apply here Why not Hint Attempt to verify that the expected rst derivative of the log likelihood with respect to is zero 2 In random sampling from the exponential distribution f x 1 e x x 0 0 nd the maximum likelihood estimator of and obtain the asymptotic distribution of this estimator 3 Mixture distribution Suppose that the joint distribution of the two random variables x and y is e y y x 0 y 0 x 0 1 2 x a Find the maximum likelihood estimators of and and their asymptotic joint distribution b Find the maximum likelihood estimator of and its asymptotic distribution c Prove that f x is of the form f x y f x 1 x x 0 1 2 and nd the maximum likelihood estimator of and its asymptotic distribution d Prove that f y x is of the form e y y x y 0 0 x Prove that f y x integrates to 1 Find the maximum likelihood estimator of and its asymptotic distribution Hint In the conditional distribution just carry the x s along as constants e Prove that f y x f y e y y 0 0

    Find the maximum likelihood estimator of and its asymptotic variance f Prove that e y y x x 0 1 2 0 x Based on this distribution what is the maximum likelihood estimator of 4 Suppose that x has the Weibull distribution f x y f x x 1 e x


    x 0 0

    a Obtain the log likelihood function for a random sample of n observations b Obtain the likelihood equations for maximum likelihood estimation of and Note that the rst provides an explicit solution for in terms of the data and But after inserting this in the second we obtain only an implicit solution for How would you obtain the maximum likelihood estimators

    Greene 50240

    book

    June 26 2002

    15 8

    CHAPTER 17 Maximum Likelihood Estimation

    523

    c Obtain the second derivatives matrix of the log likelihood with respect to and The exact expectations of the elements involving involve the derivatives of the gamma function and are quite messy analytically Of course your exact result provides an empirical estimator How would you estimate the asymptotic covariance matrix for your estimators in Part b d Prove that Cov ln x x 1 Hint The expected rst derivatives of the log likelihood function are zero 5 The following data were generated by the Weibull distribution of Exercise 4
    1 3043 1 0878 0 33453 0 49254 1 9461 1 1227 1 2742 0 47615 2 0296 1 4019 3 6454 1 2797 0 32556 0 15344 0 96080 0 29965 1 2357 2 0070 0 26423 0 96381

    6

    7

    8 9

    a Obtain the maximum likelihood estimates of and and estimate the asymptotic covariance matrix for the estimates b Carry out a Wald test of the hypothesis that 1 c Obtain the maximum likelihood estimate of under the hypothesis that 1 d Using the results of Parts a and c carry out a likelihood ratio test of the hypothesis that 1 e Carry out a Lagrange multiplier test of the hypothesis that 1 Limited Information Maximum Likelihood Estimation Consider a bivariate distribution for x and y that is a function of two parameters and The joint density is f x y We consider maximum likelihood estimation of the two parameters The full information maximum likelihood estimator is the now familiar maximum likelihood estimator of the two parameters Now suppose that we can factor the joint distribution as done in Exercise 3 but in this case we have f x y f y x f x That is the conditional density for y is a function of both parameters but the marginal distribution for x involves only a Write down the general form for the log likelihood function using the joint density b Since the joint density equals the product of the conditional times the marginal the log likelihood function can be written equivalently in terms of the factored density Write this down in general terms c The parameter can be estimated by itself using only the data on x and the log likelihood formed using the marginal density for x It can also be estimated with by using the full log likelihood function and data on both y and x Show this d Show that the rst estimator in Part c has a larger asymptotic variance than the second one This is the difference between a limited information maximum likelihood estimator and a full information maximum likelihood estimator e Show that if 2 ln f y x 0 then the result in Part d is no longer true Show that the likelihood inequality in Theorem 17 3 holds for the Poisson distribution used in Section 17 3 by showing that E 1 n ln L y is uniquely maximized at 0 Hint First show that the expectation is 0 ln E0 ln yi Show that the likelihood inequality in Theorem 17 3 holds for the normal distribution For random sampling from the classical regression model in 17 3 reparameterize the likelihood function in terms of 1 and 1 Find the maximum

    Greene 50240

    book

    June 26 2002

    15 8

    524

    CHAPTER 17 Maximum Likelihood Estimation

    likelihood estimators of and and obtain the asymptotic covariance matrix of the estimators of these parameters 10 Section 14 3 1 presents estimates of a Cobb Douglas cost function using Nerlove s 1955 data on the U S electric power industry Christensen and Greene s 1976 update of this study used 1970 data for this industry The Christensen and Greene data are given in Table F5 2 These data have provided a standard test data set for estimating different forms of production and cost functions including the stochastic frontier model examined in Example 17 5 It has been suggested that one explanation for the apparent nding of economies of scale in these data is that the smaller rms were inef cient for other reasons The stochastic frontier might allow one to disentangle these effects Use these data to t a frontier cost function which includes a quadratic term in log output in addition to the linear term and the factor prices Then examine the estimated Jondrow et al residuals to see if they do indeed vary negatively with output as suggested This will require either some programming on your part or specialized software The stochastic frontier model is provided as an option in TSP and LIMDEP Or the likelihood function can be programmed fairly easily for RATS or GAUSS Note for a cost frontier as opposed to a production frontier it is necessary to reverse the sign on the argument in the function 11 Consider sampling from a multivariate normal distribution with mean vector 1 2 M and covariance matrix 2 I The log likelihood function is ln L nM 1 nM ln 2 ln 2 2 2 2 2
    M m 1 n

    yi yi
    i 1

    Show that the maximum likelihood estimates of the parameters are ML 2
    n i 1 M m 1

    yim ym 2 1 nM M

    1 n

    n i 1

    1 yim ym M
    2

    M

    m 2
    m 1

    Derive the second derivatives matrix and show that the asymptotic covariance matrix for the maximum likelihood estimators is E 2 ln L
    1



    0 2 I n 0 2 4 nM

    Suppose that we wished to test the hypothesis that the means of the M distributions were all equal to a particular value 0 Show that the Wald statistic would be W y 0 i 2 I n
    1

    y 0 i

    n y 0 i y 0 i s2

    where y is the vector of sample means

    Greene 50240

    book

    June 26 2002

    15 6

    18

    THE GENERALIZED METHOD OF MOMENTS

    Q
    18 1 INTRODUCTION The maximum likelihood estimator is fully ef cient among consistent and asymptotically normally distributed estimators in the context of the speci ed parametric model The possible shortcoming in this result is that to attain that ef ciency it is necessary to make possibly strong restrictive assumptions about the distribution or data generating process The generalized method of moments GMM estimators discussed in this chapter move away from parametric assumptions toward estimators which are robust to some variations in the underlying data generating process This chapter will present a number of fairly general results on parameter estimation We begin with perhaps the oldest formalized theory of estimation the classical theory of the method of moments This body of results dates to the pioneering work of Fisher 1925 The use of sample moments as the building blocks of estimating equations is fundamental in econometrics GMM is an extension of this technique which as will be clear shortly encompasses nearly all the familiar estimators discussed in this book Section 18 2 will introduce the estimation framework with the method of moments Formalities of the GMM estimator are presented in Section 18 3 Section 18 4 discusses hypothesis testing based on moment equations A major applications dynamic panel data models is described in Section 18 5
    Example 18 1 Euler Equations and Life Cycle Consumption

    One of the most often cited applications of the GMM principle for estimating econometric models is Hall s 1978 permanent income model of consumption The original form of the model with some small changes in notation posits a hypothesis about the optimizing behavior of a consumer over the life cycle Consumers are hypothesized to act according to the model
    T t

    Maximize E t
    0

    1 1



    T t

    U ct

    t

    subject to
    0

    1 1 r



    ct wt At

    The information available at time t is denoted t so that E t denotes the expectation formed at time t based on information set t The maximand is the expected discounted stream of future consumption from time t until the end of life at time T The individual s subjective rate of time preference is 1 1 The real rate of interest r is assumed to be constant The utility function U ct is assumed to be strictly concave and time separable as shown in the model One period s consumption is ct The intertemporal budget constraint states that the present discounted excess of ct over earnings wt over the lifetime equals total assets At not including human capital In this model it is claimed that the only source of uncertainty is wt No assumption is made about the stochastic properties of wt except that there exists an expected future earnings E t wt t Successive values are not assumed to be independent and wt is not assumed to be stationary 525

    Greene 50240

    book

    June 26 2002

    15 6

    526

    CHAPTER 18 The Generalized Method of Moments

    Hall s major theorem in the paper is the solution to the optimization problem which states E t U ct 1
    t



    1 U ct 1 r

    For our purposes the major conclusion of the paper is Corollary 1 which states No information available in time t apart from the level of consumption ct helps predict future consumption ct 1 in the sense of affecting the expected value of marginal utility In particular income or wealth in periods t or earlier are irrelevant once ct is known We can use this as the basis of a model that can be placed in the GMM framework In order to proceed it is necessary to assume a form of the utility function A common convenient form of the utility function is U ct Ct1 1 which is monotonic U Ct 0 and concave U U Ct 0 Inserting this form into the solution rearranging the terms and reparameterizing it for convenience we have Et 1 r 1 1 ct 1 ct


    1

    t

    E t 1 r Rt 1 1

    t

    0

    Hall assumed that r was constant over time Other applications of this modeling framework e g Hansen and Singleton 1982 have modi ed the framework so as to involve a forecasted interest rate r t 1 How one proceeds from here depends on what is in the information set The unconditional mean does not identify the two parameters The corollary states that the only relevant information in the information set is ct Given the form of the model the more natural instrument might be Rt This assumption exactly identi es the two parameters in the model Et 1 r t 1 Rt 1 1 1 Rt 0 0

    As stated the model has no testable implications These two moment equations would exactly identify the two unknown parameters Hall hypothesized several models involving income and consumption which would overidentify and thus place restrictions on the model

    18 2

    CONSISTENT ESTIMATION THE METHOD OF MOMENTS

    Sample statistics such as the mean and variance can be treated as simple descriptive measures In our discussion of estimation in Appendix C however we argued that in general sample statistics each have a counterpart in the population for example the correspondence between the sample mean and the population expected value The natural perhaps obvious next step in the analysis is to use this analogy to justify using the sample moments as estimators of these population parameters What remains to establish is whether this approach is the best or even a good way to use the sample data to infer the characteristics of the population The basis of the method of moments is as follows In random sampling under generally benign assumptions a sample statistic will converge in probability to some n 2 constant For example with i i d random sampling m2 1 n i 1 yi will converge in mean square to the variance plus the square of the mean of the distribution of yi This constant will in turn be a function of the unknown parameters of the distribution To estimate K parameters 1 K we can compute K such statistics m1 mK whose probability limits are known functions of the parameters These K moments are equated

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    527

    to the K functions and the functions are inverted to express the parameters as functions of the moments The moments will be consistent by virtue of a law of large numbers Theorems D 4 D 9 They will be asymptotically normally distributed by virtue of the Lindberg Levy Central Limit Theorem D 18 The derived parameter estimators will inherit consistency by virtue of the Slutsky Theorem D 12 and asymptotic normality by virtue of the delta method Theorem D 21 This section will develop this technique in some detail partly to present it in its own right and partly as a prelude to the discussion of the generalized method of moments or GMM estimation technique which is treated in Section 18 3
    18 2 1 RANDOM SAMPLING AND ESTIMATING THE PARAMETERS OF DISTRIBUTIONS

    Consider independent identically distributed random sampling from a distribution f y 1 K with nite moments up to E y2 K The sample consists of n observations y1 yn The kth raw or uncentered moment is mk By Theorem D 1 E mk k E yik and 1 1 Var yik k2 n n 2k By convention 1 E yi By the Khinchine Theorem D 5 Var mk plim mk k E yik Finally by the Lindberg Levy Central Limit Theorem d n mk k N 0 2k k2 In general k will be a function of the underlying parameters By computing K raw moments and equating them to these functions we obtain K equations that can in principle be solved to provide estimates of the K unknown parameters
    Example 18 2

    1 n

    n

    yik
    i 1

    In random sampling from N 2 plim 1 n

    Method of Moments Estimator for N 2
    n

    yi plim m1 E yi
    i 1

    and plim 1 n
    n

    yi2 plim m2 Var yi 2 2 2
    i 1

    Equating the right and left hand sides of the probability limits gives moment estimators m1 y

    Greene 50240

    book

    June 26 2002

    15 6

    528

    CHAPTER 18 The Generalized Method of Moments

    and 2 m2 m12 1 n
    n

    yi2
    i 1



    1 n

    n

    2

    yi
    i 1



    1 n

    n

    yi y 2
    i 1

    Note that 2 is biased although both estimators are consistent

    Although the moments based on powers of y provide a natural source of information about the parameters other functions of the data may also be useful Let mk be a continuous and differentiable function not involving the sample size n and let mk 1 n
    n

    mk yi
    i 1

    k 1 2 K

    These are also moments of the data It follows from Theorem D 4 and the corollary D 5 that plim mk E mk yi k 1 K We assume that k involves some of or all the parameters of the distribution With K parameters to be estimated the K moment equations m1 1 1 K 0 m2 2 1 K 0 mK K 1 K 0 provide K equations in K unknowns 1 K If the equations are continuous and functionally independent then method of moments estimators can be obtained by solving the system of equations for k k m1 mK As suggested there may be more than one set of moments that one can use for estimating the parameters or there may be more moment equations available than are necessary
    Example 18 3 Inverse Gaussian Wald Distribution

    The inverse Gaussian distribution is used to model survival times or elapsed times from some beginning time until some kind of transition takes place The standard form of the density for this random variable is f y y 2 exp 3 2 y 2 2 y y 0 0 0

    The mean is while the variance is 3 The ef cient maximum likelihood estimators of n n the two parameters are based on 1 n y and 1 n 1 yi Since the mean and i 1 i i 1 variance are simple functions of the underlying parameters we can also use the sample mean and sample variance as moment estimators of these functions Thus an alternative pair of method of moments estimators for the parameters of the Wald distribution can be based on n n 1 n y and 1 n y2 The precise formulas for these two pairs of estimators is i 1 i i 1 i left as an exercise

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments Example 18 4 Mixtures of Normal Distributions

    529

    Quandt and Ramsey 1978 analyzed the problem of estimating the parameters of a mixture of normal distributions Suppose that each observation in a random sample is drawn from one of two different normal distributions The probability that the observation is drawn from 2 the rst distribution N 1 1 is and the probability that it is drawn from the second is 1 The density for the observed y is
    2 2 f y N 1 1 1 N 2 2 0 1 1 2 2 e 1 2 y 1 1 e 1 2 y 2 2 2 1 2 2 1 2 2 1 2 2

    The sample mean and second through fth central moments mk 1 n
    n

    yi y k
    i 1

    k 2 3 4 5

    provide ve equations in ve unknowns that can be solved via a ninth order polynomial for consistent estimators of the ve parameters Because y converges in probability to E yi the theorems given earlier for mk as an estimator of k apply as well to mk as an estimator of k E yi k For the mixed normal distribution the mean and variance are E yi 1 1 2 and
    2 2 2 Var yi 1 1 2 2 1 1 2 2

    which suggests how complicated the familiar method of moments is likely to become An alternative method of estimation proposed by the authors is based on E etyi et 1 t
    2 2 2 1

    1 et 2 t

    2 2 2 2



    t

    where t is any value not necessarily an integer Quandt and Ramsey 1978 suggest choosing ve values of t that are not too close together and using the statistics Mt 1 n
    n

    etyi
    i 1

    2 2 to estimate the parameters The moment equations are M t t 1 2 1 2 0 They label this procedure the method of moment generating functions See Section B 6 for de nition of the moment generating function

    In most cases method of moments estimators are not ef cient The exception is in random sampling from exponential families of distributions

    Greene 50240

    book

    June 26 2002

    15 6

    530

    CHAPTER 18 The Generalized Method of Moments

    DEFINITION 18 1 Exponential Family An exponential parametric family of distributions is one whose log likelihood is of the form
    K

    ln L data a data b
    k 1

    ck data sk

    where a b c and s are functions The members of the family are distinguished by the different parameter values

    If the log likelihood function is of this form then the functions ck are called suf cient statistics 1 When suf cient statistics exist method of moments estimator s can be functions of them In this case the method of moments estimators will also be the maximum likelihood estimators so of course they will be ef cient at least asymptotically We emphasize in this case the probability distribution is fully speci ed Since the normal distribution is an exponential family with suf cient statistics m1 and m2 the estimators described in Example 18 2 are fully ef cient They are the maximum likelihood estimators The mixed normal distribution is not an exponential family We leave it as an exercise to show that the Wald distribution in Example 18 3 is an exponential family You should be able to show that the suf cient statistics are the ones that are suggested in Example 18 3 as the bases for the MLEs of and
    Example 18 5 Gamma Distribution

    The gamma distribution see Section C 4 5 is f y p y P 1 ey P y 0 P 0 0

    The log likelihood function for this distribution is 1 1 ln L P ln ln P n n
    n

    yi P 1
    i 1

    1 n

    n

    ln yi
    i 1

    This function is an exponential family with a data 0 b n P ln ln P and two sufn n cient statistics 1 i 1 yi and 1 i 1 ln yi The method of moments estimators based on n n n n 1 1 y and n i 1 ln yi would be the maximum likelihood estimators But we also have i 1 i n



    plim

    1 n

    n

    i 1

    yi2 P P 1 2 ln yi P ln
    1 yi P 1

    yi





    P



    The functions P and P d ln P d P are discussed in Section E 5 3 Any two of these can be used to estimate and P
    1 Stuart

    and Ord 1989 pp 1 29 give a discussion of suf cient statistics and exponential families of distributions A result that we will use in Chapter 21 is that if the statistics ck data are suf cient statistics then the conditional density f y1 yn ck data k 1 K is not a function of the parameters

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    531

    For the income data in Example C 1 the four moments listed above are m1 m2 m m 1 1 n
    n

    yi yi2 ln yi
    i 1

    1 yi

    31 278 1453 96 3 22139 0 050014

    The method of moments estimators of P based on the six possible pairs of these moments are as follows



    m P 2 m 1 m

    m1 m2 m 1 2 05682 0 065759 2 77198 0 0886239 2 60905 0 0800475 2 4106 0 0770702 2 26450 0 071304 3 03580 0 1018202



    The maximum likelihood estimates are m1 m 2 4106 0 0770702
    18 2 2 ASYMPTOTIC PROPERTIES OF THE METHOD OF MOMENTS ESTIMATOR

    In a few cases we can obtain the exact distribution of the method of moments estimator For example in sampling from the normal distribution has mean and vari ance 2 n and is normally distributed while 2 has mean n 1 n 2 and variance n 1 n 2 2 4 n 1 and is exactly distributed as a multiple of a chi squared variate with n 1 degrees of freedom If sampling is not from the normal distribution the exact variance of the sample mean will still be Var y n whereas an asymptotic variance for the moment estimator of the population variance could be based on the leading term in D 27 in Example D 10 but the precise distribution may be intractable There are cases in which no explicit expression is available for the variance of the underlying sample moment For instance in Example 18 4 the underlying sample statistic is Mt 1 n
    n

    et yi
    i 1

    1 n

    n

    Mit
    i 1

    The exact variance of Mt is known only if t is an integer But if sampling is random since t is a sample mean we can estimate its variance with 1 n times the sample variance M of the observations on Mti We can also construct an estimator of the covariance of Mt s and M Est Asy Cov Mt Ms 1 n 1 n
    n

    et yi Mt esyi Ms
    i 1

    In general when the moments are computed as mk 1 n
    n

    mk yi
    i 1

    k 1 K

    where yi is an observation on a vector of variables an appropriate estimator of the asymptotic covariance matrix of m1 mk can be computed using 1 1 F jk n n 1 n
    n

    m j yi m j mk yi mk
    i 1

    j k 1 K

    Greene 50240

    book

    June 26 2002

    15 6

    532

    CHAPTER 18 The Generalized Method of Moments

    One might divide the inner sum by n 1 rather than n Asymptotically it is the same This estimator provides the asymptotic covariance matrix for the moments used in computing the estimated parameters Under our assumption of iid random sampling from a distribution with nite moments up to 2 K F will converge in probability to the appro priate covariance matrix of the normalized vector of moments Asy Var n mn Finally under our assumptions of random sampling though the precise distribution is likely to be unknown we can appeal to the Lindberg Levy central limit theorem D 18 to obtain an asymptotic approximation To formalize the remainder of this derivation refer back to the moment equations which we will now write mn k 1 2 K 0 k 1 K

    The subscript n indicates the dependence on a data set of n observations We have also combined the sample statistic sum and function of parameters 1 K in this general form of the moment equation Let Gn be the K K matrix whose kth row is the vector of partial derivatives mn k Gn k Now expand the set of solved moment equations around the true values of the parameters 0 in a linear Taylor series The linear approximation is 0 mn 0 Gn 0 0 Therefore n 0 Gn 0 1 n mn 0

    18 1

    We have treated this as an approximation because we are not dealing formally with the higher order term in the Taylor series We will make this explicit in the treatment of the GMM estimator below The argument needed to characterize the large sample behavior of the estimator are discussed in Appendix D We have from Theorem D 18 the Central Limit Theorem that n mn 0 has a limiting normal distribution with mean vector 0 and covariance matrix equal to Assuming that the functions in the moment equation are continuous and functionally independent we can expect Gn 0 to converge to a nonsingular matrix of constants 0 Under general conditions the limiting distribution of the right hand side of 18 1 will be that of a linear function of a normally distributed vector Jumping to the conclusion we expect the asymptotic distribution of to be normal with mean vector 0 and covariance matrix 1 n 0 1 Thus the asymptotic covariance matrix for the method 0 1 of moments estimator may be estimated with Est Asy Var
    Example 18 5 Continued

    1 1 1 G F Gn nn

    Using the estimates m1 m 2 4106 0 0770702 1 G 2 12 97515 405 8353 P 0 51241 12 97515 1

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    533

    The function is d 2 ln P dP 2 2 2 With P 2 4106 1 250832 0 658347 and 0 512408 2 The matrix F is the sample covariance matrix of y and ln y using 1 19 as the divisor F The product is 1 G F 1 G n
    1

    25 034 0 7155 0 7155 0 023873



    0 38978 0 014605 0 014605 0 00068747

    For the maximum likelihood estimator the estimate of the asymptotic covariance matrix based on the expected and actual Hessian is 1 1 H 1 n n 1 1 P 2
    1



    0 51203 0 01637 0 01637 0 00064654

    The Hessian has the same elements as G because we chose to use the suf cient statistics for the moment estimators so the moment equations that we differentiated are apart from a sign change also the derivatives of the log likelihood The estimates of the two variances are 0 51203 and 0 00064654 respectively which agrees reasonably well with the estimates above The difference would be due to sampling variability in a nite sample and the presence of F in the rst variance estimator
    18 2 3 SUMMARY THE METHOD OF MOMENTS

    In the simplest cases the method of moments is robust to differences in the speci cation of the data generating process A sample mean or variance estimates its population counterpart assuming it exists regardless of the underlying process It is this freedom from unnecessary distributional assumptions that has made this method so popular in recent years However this comes at a cost If more is known about the DGP its speci c distribution for example then the method of moments may not make use of all of the available information Thus in example 18 3 the natural estimators of the parameters of the distribution based on the sample mean and variance turn out to be inef cient The method of maximum likelihood which remains the foundation of much work in econometrics is an alternative approach which utilizes this out of sample information and is therefore more ef cient

    18 3

    THE GENERALIZED METHOD OF MOMENTS GMM ESTIMATOR

    A large proportion of the recent empirical work in econometrics particularly in macroeconomics and nance has employed GMM estimators As we shall see this broad class of estimators in fact includes most of the estimators discussed elsewhere in this book Before continuing it will be useful for you to read or reread the following sections 1 2
    2

    Consistent Estimation The Method of Moments Section 18 2 Correlation Between xi and i Instrumental Variables Estimation Section 5 4

    is the digamma function Values for P P and P are tabulated in Abramovitz and Stegun 1971 The values given were obtained using the IMSL computer program library

    Greene 50240

    book

    June 26 2002

    15 6

    534

    CHAPTER 18 The Generalized Method of Moments

    3 4 5 6 7 8

    GMM Estimation in the Generalized Regression Model Sections 10 4 11 3 and 12 6 Nonlinear Regression Models Chapter 9 Optimization Section E 5 Robust Estimation of Asymptotic Covariance Matrices Section 10 3 The Wald Test Theorem 6 1 GMM Estimation of Dynamic Panel Data Models Section 13 6

    The GMM estimation technique is an extension of the method of moments technique described in Section 18 2 3 In the following we will extend the generalized method of moments to other models beyond the generalized linear regression and we will ll in some gaps in the derivation in Section 18 2
    18 3 1 ESTIMATION BASED ON ORTHOGONALITY CONDITIONS

    Estimation by the method of moments proceeds as follows The model speci ed for the random variable yi implies certain expectations for example E yi where is the mean of the distribution of yi Estimation of then proceeds by forming a sample analog to the population expectation E yi 0 The sample counterpart to this expectation is the empirical moment equation 1 n
    n

    yi 0
    i 1

    The estimator is the value of that satis es the sample moment equation The example given is of course a trivial one Example 18 5 describes a more elaborate case of sampling from a gamma distribution The moment conditions used for estimation in that example taken two at a time from a set of four include E yi P 0 and E ln yi P ln 0

    These two coincide with the terms in the likelihood equations for this model Inserting the sample data into the sample analogs produces the moment equations for estimation 1 n
    3 Formal

    n

    yi P 0
    i 1

    presentation of the results required for this analysis are given by Hansen 1982 Hansen and Singleton 1988 Chamberlain 1987 Cumby Huizinga and Obstfeld 1983 Newey 1984 1985a 1985b Davidson and MacKinnon 1993 and McFadden and Newey 1994 Useful summaries of GMM estimation and other developments in econometrics is Pagan and Wickens 1989 and Matyas 1999 An application of some of these techniques that contains useful summaries is Pagan and Vella 1989 Some further discussion can be found in Davidson and MacKinnon 1993 Ruud 2000 provides many of the theoretical details Hayashi 2000 is another extensive treatment of estimation centered on GMM estimators

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    535

    and 1 n
    Example 18 6
    n

    ln yi
    i 1

    P ln 0

    Orthogonality Conditions

    Assuming that households are forecasting interest rates as well as earnings Hall s consumption model with the corollary implies the following orthogonality conditions Et 1 r t 1 Rt 1 1 1 Rt 0 0

    Now consider the apparently different case of the least squares estimator of the parameters in the classical linear regression model An important assumption of the model is E xi i E xi yi xi 0 The sample analog is 1 n
    n

    xi i
    i 1

    1 n

    n

    xi yi xi 0
    i 1

    The estimator of is the one that satis es these moment equations which are just the normal equations for the least squares estimator So we see that the OLS estimator is a method of moments estimator For the instrumental variables estimator of Section 5 4 we relied on a large sample analog to the moment condition plim 1 n
    n

    zi i
    i 1

    plim

    1 n

    n

    zi yi xi
    i 1

    0

    We resolved the problem of having more instruments than parameters by solving the equations 1 XZ n 1 ZZ n
    1

    1 Z n

    1 1 Xe n n

    n

    xi i 0
    i 1

    where the columns of X are the tted values in regressions on all the columns of Z that is the projections of these columns of X into the column space of Z See Section 5 4 for further details The nonlinear least squares estimator was de ned similarly though in this case the normal equations are more complicated since the estimator is only implicit The population orthogonality condition for the nonlinear regression model is E xi0 i 0 The empirical moment equation is 1 n
    n i 1

    E yi xi yi E yi xi 0

    All the maximum likelihood estimators that we have looked at thus far and will encounter later are obtained by equating the derivatives of a log likelihood to zero The

    Greene 50240

    book

    June 26 2002

    15 6

    536

    CHAPTER 18 The Generalized Method of Moments

    scaled log likelihood function is 1 1 ln L n n
    n

    ln f yi xi
    i 1

    where f is the density function and is the parameter vector For densities that satisfy the regularity conditions see Section 17 4 1 E ln f yi xi 0
    n i 1

    The maximum likelihood estimator is obtained by equating the sample analog to zero 1 ln L 1 n n ln f yi xi 0

    Dividing by n to make this result comparable with our earlier ones does not change the solution The upshot is that nearly all the estimators we have discussed and will encounter later can be construed as method of moments estimators Manski s 1992 treatment of analog estimation provides some interesting extensions and methodological discourse As we extend this line of reasoning it will emerge that nearly all the estimators de ned in this book can be viewed as method of moments estimators
    18 3 2 GENERALIZING THE METHOD OF MOMENTS

    The preceding examples all have a common aspect In each case listed save for the general case of the instrumental variable estimator there are exactly as many moment equations as there are parameters to be estimated Thus each of these are exactly identi ed cases There will be a single solution to the moment equations and at that solution the equations will be exactly satis ed 4 But there are cases in which there are more moment equations than parameters so the system is overdetermined In Example 18 5 we de ned four sample moments g 1 n
    n

    yi yi2
    i 1

    1 ln yi yi

    with probability limits P P P 1 2 P 1 and P ln respectively Any pair could be used to estimate the two parameters but as shown in the earlier example the six pairs produce six somewhat different estimates of P In such a case to use all the information in the sample it is necessary to devise a way to reconcile the con icting estimates that may emerge from the overdetermined system More generally suppose that the model involves K parameters 1 2 K and that the theory provides a set of L K moment conditions E ml yi xi zi E mil 0 where yi xi and zi are variables that appear in the model and the subscript i on mil
    4 That

    is of course if there is any solution In the regression model with collinearity there are K parameters but fewer than K independent moment equations

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    537

    indicates the dependence on yi xi zi Denote the corresponding sample means as ml y X Z 1 n
    n

    ml yi xi zi
    i 1

    1 n

    n

    mil
    i 1

    Unless the equations are functionally dependent the system of L equations in K unknown parameters ml 1 n
    n

    ml yi xi zi 0
    i 1

    l 1 L

    L will not have a unique solution 5 It will be necessary to reconcile the K different sets of estimates that can be produced One possibility is to minimize a criterion function such as the sum of squares L

    q
    l 1

    ml2 m m

    6

    18 2

    It can be shown see e g Hansen 1982 that under the assumptions we have made so far speci cally that plim m E m 0 minimizing q in 18 2 produces a consistent albeit as we shall see possibly inef cient estimator of We can in fact use as the criterion a weighted sum of squares q m Wn m where Wn is any positive de nite matrix that may depend on the data but is not a function of such as I in 18 2 to produce a consistent estimator of 7 For example we might use a diagonal matrix of weights if some information were available about the importance by some measure of the different moments We do make the additional assumption that plim Wn a positive de nite matrix W By the same logic that makes generalized least squares preferable to ordinary least squares it should be bene cial to use a weighted criterion in which the weights are inversely proportional to the variances of the moments Let W be a diagonal matrix whose diagonal elements are the reciprocals of the variances of the individual moments wll 1 1 Asy Var n ml ll

    We have written it in this form to emphasize that the right hand side involves the variance of a sample mean which is of order 1 n Then a weighted least squares procedure would minimize q m
    5 It

    1

    m

    18 3

    may if L is greater than the sample size n We assume that L is strictly less than n approach is one that Quandt and Ramsey 1978 suggested for the problem in Example 18 3

    6 This

    7 In principle the weighting matrix can be a function of the parameters as well See Hansen Heaton and Yaron

    1996 for discussion Whether this provides any bene t in terms of the asymptotic properties of the estimator seems unlikely The one payoff the authors do note is that certain estimators become invariant to the sort of normalization that we discussed in Example 17 1 In practical terms this is likely to be a consideration only in a fairly small class of cases

    Greene 50240

    book

    June 26 2002

    15 6

    538

    CHAPTER 18 The Generalized Method of Moments

    In general the L elements of m are freely correlated In 18 3 we have used a diagonal W that ignores this correlation To use generalized least squares we would de ne the full matrix 1 W Asy Var n m 1 18 4 The estimators de ned by choosing to minimize q m Wn m are minimum distance estimators The general result is that if Wn is a positive de nite matrix and if plim m 0 then the minimum distance generalized method of moments or GMM estimator of is consistent 8 Since the OLS criterion in 18 2 uses I this method produces a consistent estimator as does the weighted least squares estimator and the full GLS estimator What remains to be decided is the best W to use Intuition might suggest correctly that the one de ned in 18 4 would be optimal once again based on the logic that motivates generalized least squares This result is the now celebrated one of Hansen 1982 The asymptotic covariance matrix of this generalized method of moments estimator is 1 1 1 VGMM W 1 1 18 5 n n where is the matrix of derivatives with j th row equal to
    j

    plim

    and Asy Var n m Finally by virtue of the central limit theorem applied to the sample moments and the Slutsky theorem applied to this manipulation we can expect the estimator to be asymptotically normally distributed We will revisit the asymptotic properties of the estimator in Section 18 3 3
    Example 18 7 GMM Estimation of the Parameters of a Gamma Distribution

    m j

    Referring once again to our earlier results in Example 18 5 we consider how to use all four of our sample moments to estimate the parameters of the gamma distribution 9 The four moment equations are

    0 yi2 P P 1 2 0 E 0 ln yi P ln
    yi P 1 yi P 1 0





    8 In the most general cases a number of other subtle conditions must be met so as to assert consistency and the

    other properties we discuss For our purposes the conditions given will suf ce Minimum distance estimators are discussed in Malinvaud 1970 Hansen 1982 and Amemiya 1985
    9 We

    emphasize that this example is constructed only to illustrate the computation of a GMM estimator The gamma model is fully speci ed by the likelihood function and the MLE is fully ef cient We will examine other cases that involve less detailed speci cations later in the book

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    539

    The sample means of these will provide the moment equations for estimation Let y1 y y2 y2 y3 ln y and y4 1 y Then m1 P 1 n
    n

    yi 1 P
    i l

    1 n

    n

    yi 1 1 P y1 1 P
    i 1

    and likewise for m2 P m3 P and m4 P For our initial set of estimates we will use ordinary least squares The optimization problem is
    4 4

    Minimize P
    l 1

    mi P 2
    l 1

    yl l P 2 m P m P

    This estimator will be the minimum distance estimator with W I This nonlinear optimization problem must be solved iteratively As starting values for the iterations we used the maximum likelihood estimates from Example 18 5 P ML 2 4106 and ML 0 0770702 The least squares values that result from this procedure are P 2 0582996 and 0 06579888 We can now use these to form our estimate of W GMM estimation usually requires a rststep estimation such as this one to obtain the weighting matrix W With these new estimates in hand we obtained





    1 20

    20

    i 1

    P 1 2 yi 2 P P 1 2 yi 2 P yi 3 P ln yi 3 P ln
    yi 1 P



    yi 1 P

    yi 4 P 1

    yi 4 P 1

    Note we could have computed using the maximum likelihood estimates The GMM estimator is now obtained by minimizing q m P 1 m P The two estimates are PGMM 3 35894 and GMM 0 124489 At these two values the value of the function is q 1 97522 To obtain an asymptotic covariance matrix for the two estimates we rst recompute as shown above 24 7051 2307 126 1 0 6974 20 0 0283 m1 P m1 1 P
    2





    229 609 5 58 8148 2 1423 m2 P m2
    3

    0 0230 0 0011 0 000065413 m3 P m3 P 1 0 34635 8 0328 m4 P m4 P 1 2 1 P 1



    To complete the computation we will require the derivatives matrix G G Finally 0 202201 0 0117344 1 G 1 G 1 0 0117344 0 000867519 20

    2 P 1 2 2 P P 1 498 01 15178 2



    8 0328 216 74

    0 022372 0 42392

    Greene 50240

    book

    June 26 2002

    15 6

    540

    CHAPTER 18 The Generalized Method of Moments

    TABLE 18 1

    Estimates of the Parameters of a Gamma Distribution
    Maximum Likelihood Generalized Method of Moments

    Parameter

    P Standard Error Standard Error

    2 4106 0 87683 0 0770701 0 02707

    3 3589 0 449667 0 12449 0 029099

    gives the estimated asymptotic covariance matrix for the estimators Recall that in Example 18 5 we obtained maximum likelihood estimates of the same parameters Table 18 1 summarizes Looking ahead we should have expected the GMM estimator to improve the standard errors The fact that it does for P but not for might cast some suspicion on the speci cation of the model In fact the data generating process underlying these data is not a gamma population the values were hand picked by the author Thus the ndings in Table 18 1 might not be surprising We will return to this issue in Section 18 4 1
    18 3 3 PROPERTIES OF THE GMM ESTIMATOR

    We will now examine the properties of the GMM estimator in some detail Since the GMM estimator includes other familiar estimators that we have already encountered including least squares linear and nonlinear instrumental variables and maximum likelihood these results will extend to those cases The discussion given here will only sketch the elements of the formal proofs The assumptions we make here are somewhat narrower than a fully general treatment might allow but they are broad enough to include the situations likely to arise in practice More detailed and rigorous treatments may be found in for example Newey and McFadden 1994 White 2001 Hayashi 2000 Mittelhammer et al 2000 or Davidson 2000 This development will continue the analysis begun in Section 10 4 and add some detail to the formal results of Section 16 5 The GMM estimator is based on the set of population orthogonality conditions E mi 0 0 where we denote the true parameter vector by 0 The subscript i on the term on the right hand side indicates dependence on the observed data yi xi zi Averaging this over the sample observations produces the sample moment equation E mn 0 0 where mn 0 1 n
    n

    mi 0
    i 1

    This moment is a set of L equations involving the K parameters We will assume that this expectation exists and that the sample counterpart converges to it The de nitions are cast in terms of the population parameters and are indexed by the sample size To x the ideas consider once again the empirical moment equations which de ne the instrumental variable estimator for a linear or nonlinear regression model

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments Example 18 8

    541

    Empirical Moment Equation for Instrumental Variables
    n

    For the IV estimator in the linear or nonlinear regression model we assume E mn E 1 n zi yi h xi 0
    i 1

    There are L instrumental variables in zi and K parameters in This statement de nes L moment equations one for each instrumental variable

    We make the following assumptions about the model and these empirical moments ASSUMPTION 18 1 Convergence of the Empirical Moments The data generating process is assumed to meet the conditions for a law of large numbers to apply so that we may assume that the empirical moments converge in probability to their expectation Appendix D lists several different laws of large numbers that increase in generality What is required for this assumption is that mn 0 1 n
    n i 1

    mi 0 0

    p

    The laws of large numbers that we examined in Appendix D accommodate cases of independent observations Cases of dependent or correlated observations can be gathered under the Ergodic Theorem 12 1 For this more general case then we would assume that the sequence of observations m constant a jointly L 1 stationary and ergodic process The empirical moments are assumed to be continuous and continuously differentiable functions of the parameters For our example above this would mean that the conditional mean function h xi is a continuous function of though not necessarily of xi With continuity and differentiability we also will be able to assume that the derivatives of the moments mn 0 1 Gn 0 0 n
    n i 1

    mi n 0 0

    converge to a probability limit say plim Gn 0 G 0 For sets of independent observations the continuity of the functions and the derivatives will allow us to invoke the Slutsky Theorem to obtain this result For the more general case of sequences of dependent observations Theorem 12 2 Ergodicity of Functions will provide a counterpart to the Slutsky Theorem for time series data In sum if the moments themselves obey a law of large numbers then it is reasonable to assume that the derivatives do as well ASSUMPTION 18 2 Identi cation For any n K if 1 and 2 are two different pa rameter vectors then there exist data sets such that mn 1 mn 2 Formally in Section 16 5 3 identi cation is de ned to imply that the probability limit of the GMM criterion function is uniquely minimized at the true parameters 0

    Greene 50240

    book

    June 26 2002

    15 6

    542

    CHAPTER 18 The Generalized Method of Moments

    Assumption 18 2 is a practical prescription for identi cation More formal conditions are discussed in Section 16 5 3 We have examined two violations of this crucial assumption In the linear regression model one of the assumptions is full rank of the matrix of exogenous variables the absence of multicollinearity in X In our discussion of the maximum likelihood estimator we encountered a case Example 17 2 in which the a normalization was needed to identify the vector of parameters See Hansen et al 1996 for discussion of this case Both of these cases are included in this assumption The identi cation condition has three important implications Order Condition The number of moment conditions is at least as large as the number of parameter L K This is necessary but not suf cient for identi cation Rank Condition The L K matrix of derivatives Gn 0 will have row rank equal to K Again note that the number of rows must equal or exceed the number of columns Uniqueness With the continuity assumption the identi cation assumption implies that the parameter vector that satis es the population moment condition is unique We know that at the true parameter vector plim mn 0 0 If 1 is any parameter vector that satis es this condition then 1 must equal 0 Assumptions 18 1 and 18 2 characterize the parameterization of the model Together they establish that the parameter vector will be estimable We now make the statistical assumption that will allow us to establish the properties of the GMM estimator ASSUMPTION 18 3 Asymptotic Distribution of Empirical Moments We assume that the empirical moments obey a central limit theorem This assumes that the moments have a nite asymptotic covariance matrix 1 n so that d n mn 0 N 0 The underlying requirements on the data for this assumption to hold will vary and will be complicated if the observations comprising the empirical moment are not independent For samples of independent observations we assume the conditions underlying the Lindberg Feller D 19 or Liapounov Central Limit Theorem D 20 will suf ce For the more general case it is once again necessary to make some assumptions about the data We have assumed that E mi 0 0 If we can go a step further and assume that the functions mi 0 are an ergodic stationary martingale difference series E mi 0 mi 1 0 mi 2 0 0 then we can invoke Theorem 12 3 the Central Limit Theorem for Martingale Difference Series It will generally be fairly complicated to verify this assumption for nonlinear models so it will usually be assumed outright On the other hand the assumptions are likely to be fairly benign in a typical application For regression models the assumption takes the form E zi i zi 1 i 1 0 which will often be part of the central structure of the model

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    543

    With the assumptions in place we have

    THEOREM 18 1 Asymptotic Distribution of the GMM Estimator Under the preceding assumptions p GMM
    a GMM N VGMM

    18 6

    where VGMM is de ned in 18 5

    We will now sketch a proof of Theorem 18 1 The GMM estimator is obtained by minimizing the criterion function qn mn Wn mn where Wn is the weighting matrix used Consistency of the estimator that minimizes this criterion can be established by the same logic we used for the maximum likelihood estimator It must rst be established that qn converges to a value q0 By our assumptions of strict continuity and Assumption 18 1 qn 0 converges to 0 We could apply the Slutsky theorem to obtain this result We will assume that qn converges to q0 for other points in the parameter space as well Since Wn is positive de nite for any nite n we know that 0 qn GMM qn 0 18 7

    That is in the nite sample GMM actually minimizes the function so the sample value of the criterion is not larger at GMM than at any other value including the true parameters p But at the true parameter values qn 0 0 So if 18 7 is true then it must follow p that qn GMM 0 as well because of the identi cation assumption 18 2 As n qn GMM and qn converge to the same limit It must be the case then that as n mn GMM mn 0 since the function is quadratic and W is positive de nite The identi cation condition that we assumed earlier now assures that as n GMM must equal 0 This establishes consistency of the estimator We will now sketch a proof of the asymptotic normality of the estimator The rst order conditions for the GMM estimator are qn GMM 2Gn GMM Wn mn GMM 0 18 8 GMM The leading 2 is irrelevant to the solution so it will be dropped at this point The orthogonality equations are assumed to be continuous and continuously differentiable This allows us to employ the mean value theorem as we expand the empirical moments in a linear Taylor series around the true value mn GMM mn 0 Gn GMM 0 18 9 where is a point between GMM and the true parameters 0 Thus for each element k wk k GMM 1 wk 0 k for some wk such that 0 wk 1 Insert 18 9 in 18 8 to obtain Gn GMM Wn mn 0 Gn GMM Wn Gn GMM 0 0

    Greene 50240

    book

    June 26 2002

    15 6

    544

    CHAPTER 18 The Generalized Method of Moments

    Solve this equation for the estimation error and multiply by n This produces n GMM 0 Gn GMM Wn Gn 1 Gn GMM Wn n mn 0 Assuming that they have them the quantities on the left and right hand sides have the same limiting distributions By the consistency of GMM we know that GMM and both converge to 0 By the strict continuity assumed it must also be the case that
    p p Gn G 0 and Gn GMM G 0

    We have also assumed that the weighting matrix Wn converges to a matrix of constants W Collecting terms we nd that the limiting distribution of the vector on the right hand side must be the same as that on the right hand side in 18 10 p n GMM 0 G 0 WG 0 1 G 0 W n mn 0 18 10 We now invoke Assumption 18 3 The matrix in curled brackets is a set of constants The last term has the normal limiting distribution given in Assumption 18 3 The mean and variance of this limiting distribution are zero and respectively Collecting terms we have the result in Theorem 18 1 where 1 G 0 WG 0 1 G 0 W WG 0 G 0 WG 0 1 18 11 n The nal result is a function of the choice of weighting matrix W If the optimal weighting matrix W 1 is used then the expression collapses to VGMM VGMM optimal 1 G 0 n
    1

    G 0 1

    18 12

    Returning to 18 11 there is a special case of interest If we use least squares or instrumental variables with W I then VGMM 1 1 G G G n G G G 1

    This equation is essentially 10 23 to 10 24 the White or Newey West estimator which returns us to our departure point and provides a neat symmetry to the GMM principle
    18 3 4 GMM ESTIMATION OF SOME SPECIFIC ECONOMETRIC MODELS

    Suppose that the theory speci es a relationship yi h xi i where is a K 1 parameter vector that we wish to estimate This may not be a regression relationship since it is possible that Cov i h xi 0 or even Cov i x j 0 for all i and j

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    545

    Consider for example a model that contains lagged dependent variables and autocorrelated disturbances See Section 12 9 4 For the present we assume that E X 0 and E X 2

    where is symmetric and positive de nite but otherwise unrestricted The disturbances may be heteroscedastic and or autocorrelated But for the possibility of correlation between regressors and disturbances this model would be a generalized possibly nonlinear regression model Suppose that at each observation i we observe a vector of L variables zi such that zi is uncorrelated with i You will recognize zi as a set of instrumental variables The assumptions thus far have implied a set of orthogonality conditions E zi i xi 0 which may be suf cient to identify if L K or even overidentify if L K the parameters of the model For convenience de ne e X yi h xi and Z n L matrix whose ith row is zi By a straightforward extension of our earlier results we can produce a GMM estimator of The sample moments will be mn 1 n
    n

    i 1 n

    zi e xi
    i 1

    1 Z e X n

    The minimum distance estimator will be the that minimizes q mn Wmn 1 1 e X Z W Z e X n n 18 13

    for some choice of W that we have yet to determine The criterion given above produces the nonlinear instrumental variable estimator If we use W Z Z 1 then we have exactly the estimation criterion we used in Section 9 5 1 where we de ned the nonlinear instrumental variables estimator Apparently 18 13 is more general since we are not limited to this choice of W The linear IV estimator is a special case For any given choice of W as long as there are enough orthogonality conditions to identify the parameters estimation by minimizing q is at least in principle a straightforward problem in nonlinear optimization Hansen 1982 showed that the optimal choice of W for this estimator is 1 WGMM Asy Var n mn 1 Asy Var n
    n 1

    zi i
    i 1



    1 Asy Var Z e X n

    1



    18 14

    Greene 50240

    book

    June 26 2002

    15 6

    546

    CHAPTER 18 The Generalized Method of Moments

    For our model this is 1 W n
    n n

    i 1 j 1

    1 Cov zi i z j j n

    n

    n

    i j zi z j
    i 1 j 1

    Z n

    Z



    If we insert this result in 18 13 we obtain the criterion for the GMM estimator q 1 e X Z n Z n
    n n

    Z

    1

    1 Z e X n

    There is a possibly dif cult detail to be considered The GMM estimator involves 1 Z n 1 Z n
    n n

    i 1 j 1

    1 zi z j Cov i j n

    zi z j Cov yi h xi y j h x j
    i 1 j 1

    The conditions under which such a double sum might converge to a positive de nite matrix are sketched in Sections 5 3 2 and 12 4 1 Assuming that they do hold estimation appears to require that an estimate of be in hand already even though it is the object of estimation It may be that a consistent but inef cient estimator of is available Suppose for the present that one is If observations are uncorrelated then the cross observations terms may be omitted and what is required is 1 Z n Z 1 n
    n

    zi zi Var yi h xi
    i 1

    We can use the White 1980 estimator discussed in Section 11 2 2 and 11 3 for this case S0 1 n
    n

    zi zi yi h xi 2
    i 1

    18 15

    If the disturbances are autocorrelated but the process is stationary then Newey and West s 1987a estimator is available assuming that the autocorrelations are suf ciently small at a reasonable lag p S S0 where w 1 p 1 1 n
    p n p

    w
    1 i 1

    ei ei zi zi zi zi
    0

    w S

    18 16

    The maximum lag length p must be determined in advance We will require that observations that are far apart in time that is for which i is large must have increasingly smaller covariances for us to establish the convergence results that justify OLS GLS and now GMM estimation The choice of p is a re ection of how far back in time one must go to consider the autocorrelation negligible for purposes of estimating 1 n Z Z Current practice suggests using the smallest integer greater than or equal to T 1 4 Still left open is the question of where the initial consistent estimator should be obtained One possibility is to obtain an inef cient but consistent GMM estimator by

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    547

    using W I in 18 13 That is use a nonlinear or linear if the equation is linear instrumental variables estimator This rst step estimator can then be used to construct W which in turn can then be used in the GMM estimator Another possibility is that may be consistently estimable by some straightforward procedure other than GMM Once the GMM estimator has been computed its asymptotic covariance matrix and asymptotic distribution can be estimated based on 18 11 and 18 12 Recall that mn 1 n
    n

    zi i
    i 1

    which is a sum of L 1 vectors The derivative mn is a sum of L K matrices so 1 G m n In the model we are considering here i h xi The derivatives are the pseudoregressors in the linearized regression model that we examined in Section 9 2 3 Using the notation de ned there i xi 0 so 1 G n
    n n

    Gi
    i 1

    1 n

    n

    zi
    i 1

    i



    18 17

    Gi
    i 1

    1 n

    n i 1

    1 zi xi 0 Z X0 n

    18 18

    With this matrix in hand the estimated asymptotic covariance matrix for the GMM estimator is Est Asy Var G 1 ZZ n
    1 1

    G

    X0 Z Z Z 1 Z X0 1 18 19

    The two minus signs a 1 n2 and an n2 all fall out of the result If the that appears in 18 19 were 2 I then 18 19 would be precisely the asymptotic covariance matrix that appears in Theorem 5 4 for linear models and Theorem 9 3 for nonlinear models But there is an interesting distinction between this estimator and the IV estimators discussed earlier In the earlier cases when there were more instrumental variables than parameters we resolved the overidenti cation by speci cally choosing a set of K instruments the K projections of the columns of X or X0 into the column space of Z Here in contrast we do not attempt to resolve the overidenti cation we simply use all the instruments and minimize the GMM criterion Now you should be able to show that when 2 I and we use this information when all is said and done the same parameter estimates will be obtained But if we use a weighting matrix that differs from W Z Z n 1 then they are not

    Greene 50240

    book

    June 26 2002

    15 6

    548

    CHAPTER 18 The Generalized Method of Moments

    18 4

    TESTING HYPOTHESES IN THE GMM FRAMEWORK

    The estimation framework developed in the previous section provides the basis for a convenient set of statistics for testing hypotheses We will consider three groups of tests The rst is a pair of statistics that is used for testing the validity of the restrictions that produce the moment equations The second is a trio of tests that correspond to the familiar Wald LM and LR tests that we have examined at several points in the preceding chapters The third is a class of tests based on the theoretical underpinnings of the conditional moments that we used earlier to devise the GMM estimator
    18 4 1 TESTING THE VALIDITY OF THE MOMENT RESTRICTIONS

    In the exactly identi ed cases we examined earlier least squares instrumental variables maximum likelihood the criterion for GMM estimation q m Wm would be exactly zero because we can nd a set of estimates for which m is exactly zero Thus in the exactly identi ed case when there are the same number of moment equations as there are parameters to estimate the weighting matrix W is irrelevant to the solution But if the parameters are overidenti ed by the moment equations then these equations imply substantive restrictions As such if the hypothesis of the model that led to the moment equations in the rst place is incorrect at least some of the sample moment restrictions will be systematically violated This conclusion provides the basis for a test of the overidentifying restrictions By construction when the optimal weighting matrix is used 1 nq n m Est Asy Var n m n m so nq is a Wald statistic Therefore under the hypothesis of the model nq 2 L K For the exactly identi ed case there are zero degrees of freedom and q 0
    Example 18 9 Overidentifying Restrictions
    d

    In Hall s consumption model with the corollary the two orthogonality conditions noted in Example 18 6 exactly identify the two parameters But his analysis of the model suggests a way to test the speci cation The conclusion No information available in time t apart from the level of consumption ct helps predict future consumption ct 1 in the sense of affecting the expected value of marginal utility In particular income or wealth in periods t or earlier are irrelevant once ct is known suggests how one might test the model If lagged values of income Yt might equal the ratio of current income to the previous period s income are added to the set of instruments then the model is now overidenti ed by the orthogonality conditions 1 Rt E t 1 r t 1 Rt 1 1 Yt 1 Yt 2







    0 0

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    549

    A simple test of the overidentifying restrictions would be suggestive of the validity of the model Rejecting the restrictions casts doubt on the original model Hall s proposed tests to distinguish the life cycle permanent income model from other theories of consumption involved adding two lags of income to the information set His test is more involved than the one suggested above Hansen and Singleton 1982 operated directly on this form of the model Other studies for example Campbell and Mankiw 1989 as well as Hall s used the model s implications to formulate more conventional instrumental variable regression models

    The preceding is a speci cation test not a test of parametric restrictions However there is a symmetry between the moment restrictions and restrictions on the parameter vector Suppose is subjected to J restrictions linear or nonlinear which restrict the number of free parameters from K to K J That is reduce the dimensionality of the parameter space from K to K J The nature of the GMM estimation problem we have posed is not changed at all by the restrictions The constrained problem may be stated in terms of qR m R Wm R Note that the weighting matrix W is unchanged The precise nature of the solution method may be changed the restrictions mandate a constrained optimization However the criterion is essentially unchanged It follows then that nqR 2 L K J This result suggests a method of testing the restrictions though the distribution theory is not obvious The weighted sum of squares with the restrictions imposed nqR must be larger than the weighted sum of squares obtained without the restrictions nq The difference is nqR nq 2 J
    d d

    18 20

    The test is attributed to Newey and West 1987b This provides one method of testing a set of restrictions The small sample properties of this test will be the central focus of the application discussed in Section 18 5 We now consider several alternatives
    18 4 2 GMM COUNTERPARTS TO THE WALD LM AND LR TESTS

    Section 17 5 described a trio of testing procedures that can be applied to a hypothesis in the context of maximum likelihood estimation To reiterate let the hypothesis to be tested be a set of J possibly nonlinear restrictions on K parameters in the form H0 r 0 Let c1 be the maximum likelihood estimates of estimated without the restrictions and let c 0 denote the restricted maximum likelihood estimates that is the estimates obtained while imposing the null hypothesis The three statistics which are asymptotically equivalent are obtained as follows LR likelihood ratio 2 ln L0 ln L1 where ln Lj log likelihood function evaluated at c j j 0 1

    Greene 50240

    book

    June 26 2002

    15 6

    550

    CHAPTER 18 The Generalized Method of Moments

    The likelihood ratio statistic requires that both estimates be computed The Wald statistic is W Wald r c1 Est Asy Var r c1
    1

    r c1

    18 21

    The Wald statistic is the distance measure for the degree to which the unrestricted estimator fails to satisfy the restrictions The usual estimator for the asymptotic covariance matrix would be Est Asy Var r c1 A1 Est Asy Var c1 A1 where A1 r c1 c1 A1 is a J K matrix 18 22

    The Wald statistic can be computed using only the unrestricted estimate The LM statistic is LM Lagrange multiplier g1 c 0 Est Asy Var g1 c 0 where g1 c 0 ln L1 c 0 c 0 that is the rst derivatives of the unconstrained log likelihood computed at the restricted estimates The term Est Asy Var g1 c 0 is inverse of any of the usual estimators of the asymptotic covariance matrix of the maximum likelihood estimators of the parameters computed using the restricted estimates The most convenient choice is usually the BHHH estimator The LM statistic is based on the restricted estimates Newey and West 1987b have devised counterparts to these test statistics for the GMM estimator The Wald statistic is computed identically using the results of GMM estimation rather than maximum likelihood 10 That is in 18 21 we would use the unrestricted GMM estimator of The appropriate asymptotic covariance matrix is 18 12 The computation is exactly the same The counterpart to the LR statistic is the difference in the values of nq in 18 20 It is necessary to use the same weighting matrix W in both restricted and unrestricted estimators Since the unrestricted estimator is consistent under both H0 and H1 a consistent unrestricted estimator of is 1 used to compute W Label this 1 Asy Var n m1 c1 In each occurrence 1 the subscript 1 indicates reference to the unrestricted estimator Then q is minimized without restrictions to obtain q1 and then subject to the restrictions to obtain q0 The statistic is then nq0 nq1 11 Since we are using the same W in both cases this statistic is necessarily nonnegative This is the statistic discussed in Section 18 4 1 Finally the counterpart to the LM statistic would be LMGMM n m1 c 0 1 G1 c 0 G1 c 0 1 G1 c 0 1 1
    10 See

    1

    g1 c 0

    18 23

    1

    G1 c 0 1 m1 c 0 1

    Burnside and Eichenbaum 1996 for some small sample results on this procedure Newey and McFadden 1994 have shown the asymptotic equivalence of the three procedures and West label this test the D test

    11 Newey

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    551

    The logic for this LM statistic is the same as that for the MLE The derivatives of the minimized criterion q in 18 3 are q g1 c 0 2G1 c 0 1 m c 0 1 c0 The LM statistic LMGMM is a Wald statistic for testing the hypothesis that this vector equals zero under the restrictions of the null hypothesis From our earlier results we would have 4 Est Asy Var g1 c 0 G1 c 0 1 Est Asy Var n m c 0 1 G1 c 0 1 1 n The estimated asymptotic variance of n m c 0 is 1 so Est Asy Var g1 c 0 The Wald statistic would be Wald g1 c 0 Est Asy Var g1 c 0 g1 c 0 1 1 1 n m1 c 0 1 G c 0 G c 0 1 G c 0 G c 0 1 m1 c 0 1
    1

    4 G1 c 0 1 G1 c 0 1 n

    18 24

    18 5

    APPLICATION GMM ESTIMATION OF A DYNAMIC PANEL DATA MODEL OF LOCAL GOVERNMENT EXPENDITURES

    This example continues the analysis begun in Example 13 7 Dahlberg and Johansson 2000 estimated a model for the local government expenditure of several hundred municipalities in Sweden observed over the 9 year period t 1979 to 1987 The equation of interest is
    m m m

    Si t t
    j 1

    j Si t j
    j 1

    j Ri t j
    j 1

    j Gi t j fi it

    for i 1 N 265 and t m 1 9 We have changed their notation slightly to make it more convenient Si t Ri t and Gi t are municipal spending receipts taxes and fees and central government grants respectively Analogous equations are speci ed for the current values of Ri t and Gi t The appropriate lag length m is one of the features of interest to be determined by the empirical study The model contains a municipality speci c effect fi which is not speci ed as being either xed or random In order to eliminate the individual effect the model is converted to rst differences The resulting equation is
    m m m

    Si t t
    j 1

    j Si t j
    j 1

    j

    Ri t j
    j 1

    j Gi t j uit

    or yi t xi t ui t where Si t Si t Si t 1 and so on and ui t i t i t 1 This removes the group effect and leaves the time effect Since the time effect was unrestricted to begin with

    Greene 50240

    book

    June 26 2002

    15 6

    552

    CHAPTER 18 The Generalized Method of Moments

    t t remains an unrestricted time effect which is treated as xed and modeled with a time speci c dummy variable The maximum lag length is set at m 3 With 9 years of data this leaves useable observations from 1983 to 1987 for estimation that is t m 2 9 Similar equations were t for Ri t and Gi t The orthogonality conditions claimed by the authors are E Si s ui t E Ri s ui t E Gi s ui t 0 s 1 t 2

    The orthogonality conditions are stated in terms of the levels of the nancial variables and the differences of the disturbances The issue of this formulation as opposed to for example E Si s i t 0 which is implied is discussed by Ahn and Schmidt 1995 As we shall see this set of orthogonality conditions implies a total of 80 instrumental variables The authors use only the rst of the three sets listed above which produces a total of 30 For the ve observations using the formulation developed in Section 13 6 we have the following matrix of instrumental variables for the orthogonality conditions 1983 0 0 0 0 0 0 0 0 S81 79 d83 0 S82 79 d84 0 0 0 0 0 0 1984 0 0 0 0 S83 79 d85 0 0 0 0 1985 Zi 0 0 0 0 0 0 S84 79 d86 0 0 1986 0 0 0 0 0 0 0 0 0 S85 79 d87 1987 where the notation Et 1 t 0 indicates the range of years for that variable For example S83 79 denotes Si 1983 Si 1982 Si 1981 Si 1980 Si 1979 and dyear denotes the year speci c dummy variable Counting columns in Zi we see that using only the lagged values of the dependent variable and the time dummy variables we have 3 1 4 1 5 1 6 1 7 1 30 instrumental variables Using the lagged values of the other two variables in each equation would add 50 more for a total of 80 if all the orthogonality conditions suggested above were employed Given the construction above the orthogonality conditions are now E Zi ui 0 where ui ui 1987 ui 1986 ui 1985 ui 1984 ui 1983 The empirical moment equation is plim 1 n
    N

    Zi ui plim m 0
    i 1

    The parameters are vastly overidenti ed Using only the lagged values of the dependent variable in each of the three equations estimated there are 30 moment conditions and 14 parameters being estimated when m 3 11 when m 2 8 when m 1 and 5 when m 0 As we do our estimation of each of these we will retain the same matrix of instrumental variables in each case GMM estimation proceeds in two steps In the rst step basic unweighted instrumental variables is computed using 1 IV
    N N 1 N

    Xi Zi
    i 1 i 1

    Zi Zi
    i 1

    Zi Xi

    N

    N

    1

    N

    Xi Zi
    i 1 i 1

    Zi Zi
    i 1

    Zi yi

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    553

    where yi S83 and S82 S83 S84 S85 S86 S81 S82 S83 S84 S85 S80 S81 S82 S83 S84 R82 R83 R84 R85 R86 S84 R81 R82 R83 R84 R85 S85 R80 R81 R82 R83 R84 S86 G82 G83 G84 G85 G86
    N

    S87 G81 G82 G83 G84 G85 G80 1 0 0 0 0

    Xi

    G81 0 1 0 0 0 G82 0 0 1 0 0 G83 0 0 0 1 0 G84 0 0 0 0 1

    The second step begins with the computation of the new weighting matrix 1 Est Asy Var N m N Zi ui ui Zi
    i 1

    After multiplying and dividing by the implicit 1 N in the outside matrices we obtain the estimator
    N N 1 N 1

    GMM
    i 1 N

    Xi Zi
    i 1 N

    Zi ui ui Zi
    i 1 1 N

    Zi Xi


    i 1 N

    Xi Zi
    i 1

    Zi ui ui Zi
    i 1 1 N N

    Zi yi
    N


    i 1

    Xi Zi

    W
    i 1

    Zi Xi
    i 1

    Xi Zi

    W
    i 1

    Zi yi



    The estimator of the asymptotic covariance matrix for the estimator is the matrix in square brackets in the rst line of the result The primary focus of interest in the study was not the estimator itself but the lag length and whether certain lagged values of the independent variables appeared in each equation These restrictions would be tested by using the GMM criterion function which in this formulation would be based on recomputing the residuals after GMM estimation
    n n

    q
    i 1

    ui Zi

    W
    i 1

    Zi ui



    Note that the weighting matrix is not necessarily recomputed For purposes of testing hypotheses the same weighting matrix should be used At this point we will consider the appropriate lag length m The speci cation can be reduced simply by rede ning X to change the lag length In order to test the speci cation the weighting matrix must be kept constant for all restricted versions m 2 and m 1 of the model The Dahlberg and Johansson data may be downloaded from the Journal of Applied Econometrics website See Appendix Table F18 1 The authors provide the summary statistics for the raw data that are given in Table 18 2 The data used in the study

    Greene 50240

    book

    June 26 2002

    15 6

    554

    CHAPTER 18 The Generalized Method of Moments

    TABLE 18 2 Variable

    Descriptive Statistics for Local Expenditure Data
    Std Deviation Minimum Maximum

    Mean

    Spending Revenues Grants

    18478 51 13422 56 5236 03

    3174 36 3004 16 1260 97

    12225 68 6228 54 1570 64

    33883 25 29141 62 12589 14

    TABLE 18 3 Variable

    Estimated Spending Equation
    Estimate Standard Error t Ratio

    Year 1983 Year 1984 Year 1985 Year 1986 Year 1987 Spending t 1 Revenues t 1 Grants t 1 Spending t 2 Revenues t 2 Grants t 2 Spending t 3 Revenues t 3 Grants t 3

    0 0036578 0 00049670 0 00038085 0 00031469 0 00086878 1 15493 1 23801 0 016310 0 0376625 0 0770075 1 55379 0 56441 0 64978 1 78918

    0 0002969 0 0004128 0 0003094 0 0003282 0 0001480 0 34409 0 36171 0 82419 0 22676 0 27179 0 75841 0 21796 0 26930 0 69297

    12 32 1 20 1 23 0 96 5 87 3 36 3 42 0 02 0 17 0 28 2 05 2 59 2 41 2 58

    and provided in the internet source are nominal values in Swedish Kroner de ated by a municipality speci c price index then converted to per capita values Descriptive statistics for the raw and transformed data appear in Table 18 2 12 Equations were estimated for all three variables with maximum lag lengths of m 1 2 and 3 The authors did not provide the actual estimates Estimation is done using the methods developed by Ahn and Schmidt 1995 Arellano and Bover 1995 and Holtz Eakin Newey and Rosen 1988 as described above The estimates of the rst speci cation given above are given in Table 18 3 Table 18 4 contains estimates of the model parameters for each of the three equations and for the three lag lengths as well as the value of the GMM criterion function for each model estimated The base case for each model has m 3 There are three restrictions implied by each reduction in the lag length The critical chi squared value for three degrees of freedom is 7 81 for 95 percent signi cance so at this level we nd that the two level model is just barely accepted for the spending equation but clearly appropriate for the other two the difference between the two criteria is 7 62 Conditioned on m 2 only the revenue model rejects the restriction of m 1 As a nal test we might ask whether the data suggest that perhaps no lag structure at all is necessary The GMM criterion value for the three equations with only the time dummy variables are 45 840 57 908 and 62 042 respectively Therefore all three zero lag models are rejected
    12 The

    data provided on the website and used in our computations were further transformed by dividing by 100 000

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    555

    TABLE 18 4 m 3

    Estimated Lag Equations for Spending Revenue and Grants
    Revenue Model m 3 m 2 m 1 m 3 Grant Model m 2 m 1 m 2 m 1

    Expenditure Model

    St 1 St 2 St 3 Rt 1 Rt 2 Rt 3 Gt 1 Gt 2 Gt 3 q

    1 155 0 0377 0 5644 1 2380 0 0770 0 6497 0 0163 1 5538 1 7892 22 8287

    0 8742 0 5562 0 2493 0 8745 0 5328 0 2776 0 4203 0 1275 0 1866 30 4526 34 4986

    0 1715 0 3117 0 1242 0 1621 0 0773 0 1772 0 0176 0 1863 0 0245 0 0309 0 1368 0 0034 0 3683 0 5425 0 0808 2 7152 2 4621 0 0948 30 5398 34 2590 53 2506

    0 1675 0 0303 0 0955 0 1578 0 0485 0 0319 0 2381 0 0492 0 0598 17 5810

    0 1461 0 1958 0 0304 0 1453 0 2343 0 0175 0 2066 0 0559 0 0804 20 5416 27 5927

    Among the interests in this study were the appropriate critical values to use for the speci cation test of the moment restriction With 16 degrees of freedom the critical chisquared value for 95 percent signi cance is 26 3 which would suggest that the revenues equation is misspeci ed Using a bootstrap technique the authors nd that a more appropriate critical value leaves the speci cation intact Finally note that the threeequation model in the m 3 columns of Table 18 4 imply a vector autoregression of the form yt
    1 yt 1



    2 yt 2



    3 yt 3

    vt

    where yt St Rt Gt We will explore the properties and characteristics of equation systems such as this in our discussion of time series models in Chapter 20

    18 6

    SUMMARY AND CONCLUSIONS

    The generalized method of moments provides an estimation framework that includes least squares nonlinear least squares instrumental variables and maximum likelihood and a general class of estimators that extends beyond these But it is more than just a theoretical umbrella The GMM provides a method of formulating models and implied estimators without making strong distributional assumptions Hall s model of household consumption is a useful example that shows how the optimization conditions of an underlying economic theory produce a set of distribution free estimating equations In this chapter we rst examined the classical method of moments GMM as an estimator is an extension of this strategy that allows the analyst to use additional information beyond that necessary to identify the model in an optimal fashion After de ning and establishing the properties of the estimator we then turned to inference procedures It is convenient that the GMM procedure provides counterparts to the familiar trio of test statistics Wald LM and LR In the nal section we developed an example that appears at many points in the recent applied literature the dynamic panel data model with individual speci c effects and lagged values of the dependent variable This chapter concludes our survey of estimation techniques and methods in econometrics In the remaining chapters of the book we will examine a variety of applications

    Greene 50240

    book

    June 26 2002

    15 6

    556

    CHAPTER 18 The Generalized Method of Moments

    and modeling tools rst in time series and macroeconometrics in Chapters 19 and 20 then in discrete choice models and limited dependent variables the staples of microeconometrics in Chapters 21 and 22 Key Terms and Concepts
    Analog estimation Asymptotic properties Central limit theorem Central moments Consistent estimator Dynamic panel data model Empirical moment equation Ergodic theorem Euler equation Exactly identi ed Exponential family Generalized method of LR statistic Martingale difference Order condition Orthogonality conditions Overidentifying restrictions Probability limit Random sample Rank condition Robust estimation Slutsky Theorem Speci cation test statistic Suf cient statistic Taylor series Uncentered moment Wald statistic Weighted least squares

    sequence
    Maximum likelihood

    estimator
    Mean value theorem Method of moment

    generating functions
    Method of moments Method of moments

    estimators
    Minimum distance estimator Moment equation Newey West estimator Nonlinear instrumental

    moments
    Identi cation Instrumental variables LM statistic

    variable estimator

    Exercises 1 For the normal distribution 2k 2k 2k k 2k and 2k 1 0 k 0 1 Use this result to analyze the two estimators b1 m3
    3 2 m2

    and b2

    m4 m2 2

    1 where mk n in 1 xi x k The following result will be useful Asy Cov nm j nmk j k j k jk 2 j 1 k 1 j j 1 k 1 k k 1 j 1

    Use the delta method to obtain the asymptotic variances and covariance of these two functions assuming the data are drawn from a normal distribution with mean and variance 2 Hint Under the assumptions the sample mean is a consistent estimator of so for purposes of deriving asymptotic results the difference between x and may be ignored As such no generality is lost by assuming the mean is zero and proceeding from there Obtain V the 3 3 covariance matrix for the three moments then use the delta method to show that the covariance matrix for the two estimators is JVJ 6 0 0 24

    2

    where J is the 2 3 matrix of derivatives Using the results in Example 18 7 estimate the asymptotic covariance matrix of the method of moments estimators of P and based on m1 and m2 Note You will need to use the data in Example C 1 to estimate V

    Greene 50240

    book

    June 26 2002

    15 6

    CHAPTER 18 The Generalized Method of Moments

    557

    3

    4

    5

    Exponential Families of Distributions For each of the following distributions determine whether it is an exponential family by examining the log likelihood function Then identify the suf cient statistics a Normal distribution with mean and variance 2 b The Weibull distribution in Exercise 4 in Chapter 17 c The mixture distribution in Exercise 3 in Chapter 17 In the classical regression model with heteroscedasticity which is more ef cient ordinary least squares or GMM Obtain the two estimators and their respective asymptotic covariance matrices then prove your assertion Consider the probit model analyzed in Section 17 8 The model states that for given vector of independent variables Prob yi 1 xi xi Prob yi 0 xi 1 Prob yi 1 xi

    We have considered maximum likelihood estimation of the parameters of this model at several points Consider instead a GMM estimator based on the result that E yi xi E yi xi

    This suggests that we might base estimation on the orthogonality conditions xi xi 0

    6

    Construct a GMM estimator based on these results Note that this is not the nonlinear least squares estimator Explain what would the orthogonality conditions be for nonlinear least squares estimation of this model Consider GMM estimation of a regression model as shown at the beginning of Example 18 8 Let W1 be the optimal weighting matrix based on the moment equations Let W2 be some other positive de nite matrix Compare the asymptotic covariance matrices of the two proposed estimators Show conclusively that the asymptotic covariance matrix of the estimator based on W1 is not larger than that based on W2

    Greene 50240

    book

    June 26 2002

    21 55

    19

    MODELS WITH LAGGED VARIABLES

    Q
    19 1 INTRODUCTION This chapter begins our introduction to the analysis of economic time series By most views this eld has become synonymous with empirical macroeconomics and the analysis of nancial markets 1 In this and the next chapter we will consider a number of models and topics in which time and relationships through time play an explicit part in the formulation Consider the dynamic regression model yt 1 2 xt 3 xt 1 yt 1 t 19 1

    Models of this form speci cally include as right hand side variables earlier as well as contemporaneous values of the regressors It is also in this context that lagged values of the dependent variable appear as a consequence of the theoretical basis of the model rather than as a computational means of removing autocorrelation There are several reasons why lagged effects might appear in an empirical model





    In modeling the response of economic variables to policy stimuli it is expected that there will be possibly long lags between policy changes and their impacts The length of lag between changes in monetary policy and its impact on important economic variables such as output and investment has been a subject of analysis for several decades Either the dependent variable or one of the independent variables is based on expectations Expectations about economic events are usually formed by aggregating new information and past experience Thus we might write the expectation of a future value of variable x formed this period as xt Et xt 1 zt xt 1 xt 2 g zt xt 1 xt 2

    1 The

    literature in this area has grown at an impressive rate and more so than in any other area it has become impossible to provide comprehensive surveys in general textbooks such as this one Fortunately specialized volumes have been produced that can ll this need at any level Harvey 1990 has been in wide use for some time Among the many other books written in the 1990s three very useful works are Enders 1995 which presents the basics of time series analysis at an introductory level with several very detailed applications Hamilton 1994 which gives a relatively technical but quite comprehensive survey of the eld and Lutkepohl 1993 which provides an extremely detailed treatment of the topics presented at the end of this chapter Hamilton also surveys a number of the applications in the contemporary literature Two references that are focused on nancial econometrics are Mills 1993 and Tsay 2002 There are also a number of important references that are primarily limited to forecasting including Diebold 1998a 1998b and Granger and Newbold 1996 A survey of recent research in many areas of time series analysis is Engle and McFadden 1994 An extensive fairly advanced treatise that analyzes in great depth all the issues we touch on in this chapter is Hendry 1995 Finally Patterson 2000 surveys most of the practical issues in time series and presents a large variety of useful and very detailed applications

    558

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    559



    For example forecasts of prices and income enter demand equations and consumption equations See Example 18 1 for an in uential application Certain economic decisions are explicitly driven by a history of related activities For example energy demand by individuals is clearly a function not only of current prices and income but also the accumulated stocks of energy using capital Even energy demand in the macroeconomy behaves in this fashion the stock of automobiles and its attendant demand for gasoline is clearly driven by past prices of gasoline and automobiles Other classic examples are the dynamic relationship between investment decisions and past appropriation decisions and the consumption of addictive goods such as cigarettes and theater performances

    We begin with a general discussion of models containing lagged variables In Section 19 2 we consider some methodological issues in the speci cation of dynamic regressions In Sections 19 3 and 19 4 we describe a general dynamic model that encompasses some of the extensions and more formal models for time series data that are presented in Chapter 20 Section 19 5 takes a closer look at some of issues in model speci cation Finally Section 19 6 considers systems of dynamic equations These are largely extensions of the models that we examined at the end of Chapter 15 But the interpretation is rather different here This chapter is generally not about methods of estimation OLS and GMM estimation are usually routine in this context Since we are examining time series data conventional assumptions including ergodicity and stationarity will be made at the outset In particular in the general framework we will assume that the multivariate stochastic process yt xt t are a stationary and ergodic process As such without further analysis we will invoke the theorems discussed in Chapters 5 12 16 and 18 that support least squares and GMM as appropriate estimate techniques in this context In most of what follows in fact in practical terms the dynamic regression model can be treated as a linear regression model and estimated by conventional methods e g ordinary least squares or instrumental variables if t is autocorrelated As noted we will generally not return to the issue of estimation and inference theory except where new results are needed such as in the discussion of nonstationary processes

    19 2

    DYNAMIC REGRESSION MODELS

    In some settings economic agents respond not only to current values of independent variables but to past values as well When effects persist over time an appropriate model will include lagged variables Example 19 1 illustrates a familiar case
    Example 19 1 A Structural Model of the Demand for Gasoline

    Drivers demand gasoline not for direct consumption but as fuel for cars to provide a source of energy for transportation Per capita demand for gasoline in any period G pop is determined partly by the current price Pg and per capita income Y pop which in uence how intensively the existing stock of gasoline using capital K is used and partly by the size and composition of the stock of cars and other vehicles The capital stock is determined in turn by income Y pop prices of the equipment such as new and used cars Pnc and Puc the price of alternative modes of transportation such as public transportation Ppt and past prices of gasoline as they in uence forecasts of future gasoline prices A structural model of

    Greene 50240

    book

    June 26 2002

    21 55

    560

    CHAPTER 19 Models with Lagged Variables

    these effects might appear as follows per capita demand stock of vehicles investment in new vehicles expected price of gasoline Gt popt Pgt Yt popt K t ut Kt 1 K t 1 I t depreciation rate I t Yt popt E t Pgt 1 1 Pnct 2 Puct 3 Pptt E t Pgt 1 w0 Pgt w1 Pgt 1 w2 Pgt 2

    The capital stock is the sum of all past investments so it is evident that not only current income and prices but all past values play a role in determining K When income or the price of gasoline changes the immediate effect will be to cause drivers to use their vehicles more or less intensively But over time vehicles are added to the capital stock and some cars are replaced with more or less ef cient ones These changes take some time so the full impact of income and price changes will not be felt for several periods Two episodes in the recent history have shown this effect clearly For well over a decade following the 1973 oil shock drivers gradually replaced their large fuel inef cient cars with smaller less fuelintensive models In the late 1990s in the United States this process has visibly worked in reverse As American drivers have become accustomed to steadily rising incomes and steadily falling real gasoline prices the downsized ef cient coupes and sedans of the 1980s have yielded the highways to a tide of ever larger six and eight cylinder sport utility vehicles whose size and power can reasonably be characterized as astonishing
    19 2 1 LAGGED EFFECTS IN A DYNAMIC MODEL

    The general form of a dynamic regression model is


    yt
    i 0

    i xt i t

    19 2

    In this model a one time change in x at any point in time will affect E ys xt xt 1 in every period thereafter When it is believed that the duration of the lagged effects is extremely long for example in the analysis of monetary policy in nite lag models that have effects that gradually fade over time are quite common But models are often constructed in which changes in x cease to have any in uence after a fairly small number of periods We shall consider these nite lag models rst Marginal effects in the static classical regression model are one time events The response of y to a change in x is assumed to be immediate and to be complete at the end of the period of measurement In a dynamic model the counterpart to a marginal effect is the effect of a one time change in xt on the equilibrium of yt If the level of xt has been unchanged from say x for many periods prior to time t then the equilibrium value of E yy xt xt 1 assuming that it exists will be


    y
    i 0

    i x x
    i 0

    i

    19 3

    where x is the permanent value of xt For this value to be nite we require that


    i
    i 0

    19 4

    Consider the effect of a unit change in x occurring in period s To focus ideas consider the earlier example of demand for gasoline and suppose that xt is the unit price Prior to the oil shock demand had reached an equilibrium consistent with accumulated habits

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    561

    D0
    0

    1

    Demand

    i 2 3

    D1

    t t 2

    1 t

    t

    1 t 2 Time

    FIGURE 19 1

    Lagged Adjustment

    experience with stable real prices and the accumulated stocks of vehicles Now suppose that the price of gasoline Pg rises permanently from Pg to Pg 1 in period s The path to the new equilibrium might appear as shown in Figure 19 1 The short run effect is the one that occurs in the same period as the change in x This effect is 0 in the gure

    DEFINITION 19 1 Impact Multiplier 0 impact multiplier short run multiplier

    DEFINITION 19 2 Cumulated Effect The accumulated effect periods later of an impulse at time t is

    i 0

    i

    In Figure 19 1 we see that the total effect of a price change in period t after three periods have elapsed will be 0 1 2 3 The difference between the old equilibrium D0 and the new one D1 is the sum of the individual period effects The long run multiplier is this total effect

    Greene 50240

    book

    June 26 2002

    21 55

    562

    CHAPTER 19 Models with Lagged Variables

    DEFINITION 19 3 Equilibrium Multiplier i 0 i equilibrium multiplier long run multiplier

    Since the lag coef cients are regression coef cients their scale is determined by the scales of the variables in the model As such it is often useful to de ne the lag weights wi so that
    i 0

    i j 0

    j

    19 5

    wi 1 and to rewrite the model as


    yt
    i 0

    wi xt i t

    19 6

    Note the equation for the expected price in Example 19 1 Two useful statistics based on the lag weights that characterize the period of adjustment to a new equilibrium are q the median lag smallest q such that i 0 wi 0 5 and the mean lag i 0 iwi 2
    19 2 2 THE LAG AND DIFFERENCE OPERATORS

    A convenient device for manipulating lagged variables is the lag operator L xt xt 1 Some basic results are La a if a is a constant and L L xt L2 xt xt 2 Thus Lp xt xt p Lq Lp xt Lp q xt xt p q and Lp Lq xt xt p xt q By convention L0 xt 1xt xt A related operation is the rst difference xt xt xt 1 Obviously xt 1 L xt and xt xt 1 combined for example as in
    2

    xt These two operations can be usefully

    xt 1 L 2 xt 1 2 L L2 xt xt 2xt 1 xt 2

    Note that 1 L 2 xt 1 L 1 L xt 1 L xt xt 1 xt xt 1 xt 1 xt 2 The dynamic regression model can be written


    yt
    i 0

    i Li xt t B L xt t

    2 If the lag coef cients do not all have the same sign then these results may not be meaningful In some contexts

    lag coef cients with different signs may be taken as an indication that there is a aw in the speci cation of the model

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    563

    where B L is a polynomial in L B L 0 1 L 2 L2 A polynomial in the lag operator that reappears in many contexts is


    A L 1 a L a L 2 a L 3
    i 0

    a L i

    If a 1 then A L A distributed lag model in the form


    1 1 aL

    yt
    i 0

    i Li xt t

    can be written yt 1 L 1 xt t if 1 This form is called the moving average form or distributed lag form If we multiply through by 1 L and collect terms then we obtain the autoregressive form yt 1 xt yt 1 1 L t In more general terms consider the pth order autoregressive model yt xt 1 yt 1 2 yt 2 p yt p t which may be written C L yt xt t where C L 1 1 L 2 L2 p Lp Can this equation be inverted so that yt is written as a function only of current and past values of xt and t By successively substituting the corresponding autoregressive equation for yt 1 in that for yt then likewise for yt 2 and so on it would appear so However it is also clear that the resulting distributed lag form will have an in nite number of coef cients Formally the operation just described amounts to writing yt C L 1 xt t A L xt t It will be of interest to be able to solve for the elements of A L see for example Section 19 6 6 By this arrangement it follows that C L A L 1 where A L 0 L0 1 L 2 L2 By collecting like powers of L in 1 1 L 2 L2 p Lp 0 L0 1 L 2 L2 1

    Greene 50240

    book

    June 26 2002

    21 55

    564

    CHAPTER 19 Models with Lagged Variables

    we nd that a recursive solution for the coef cients is L0 0 L 1 1 0
    1 2

    1 0 0 0 0 0 19 7

    L 2 1 1 2 0 L3 3 1 2 2 1 3 0 L4 4 1 3 2 2 3 1 4 0 Lp p 1 p 1 2 p 2 p 0 and thereafter

    Lq q 1 q 1 2 q 2 p q p 0 After a set of p 1 starting values the coef cients obey the same difference equation as yt does in the dynamic equation One problem remains For the given set of values the preceding gives no assurance that the solution for q does not ultimately explode The equation system above is not necessarily stable for all values of j though it certainly is for some If the system is stable in this sense then the polynomial C L is said to be invertible The necessary conditions are precisely those discussed in Section 19 4 3 so we will defer completion of this discussion until then Finally two useful results are B 1 0 10 1 11 2 12 long run multiplier and


    B 1 dB L dL L 1
    i 0

    i i

    It follows that B 1 B 1 mean lag
    19 2 3 SPECIFICATION SEARCH FOR THE LAG LENGTH

    Various procedures have been suggested for determining the appropriate lag length in a dynamic model such as
    p

    yt
    i 0

    i xt i t

    19 8

    One must be careful about a purely signi cance based speci cation search Let us suppose that there is an appropriate true value of p 0 that we seek A simple togeneral approach to nding the right lag length would depart from a model with only the current value of the independent variable in the regression and add deeper lags until a simple t test suggested that the last one added is statistically insigni cant The problem with such an approach is that at any level at which the number of included lagged variables is less than p the estimator of the coef cient vector is biased and inconsistent See the omitted variable formula 8 4 The asymptotic covariance matrix is biased as well so statistical inference on this basis is unlikely to be successful A general tosimple approach would begin from a model that contains more than p lagged values it

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    565

    is assumed that though the precise value of p is unknown the analyst can posit a maintained value that should be larger than p Least squares or instrumental variables regression of y on a constant and p d lagged values of x consistently estimates 0 1 p 0 0 Since models with lagged values are often used for forecasting researchers have tended to look for measures that have produced better results for assessing out of sample prediction properties The adjusted R2 see Section 3 5 1 is one possibility Others include the Akaike 1973 information criterion AIC p AIC p ln and Schwartz s criterion SC p SC p AIC p p ln T 2 T 19 10 e e 2p T T 19 9

    See Section 8 4 If some maximum P is known then p P can be chosen to minimize AIC p or SC p 3 An alternative approach also based on a known P is to do sequential F tests on the last P p coef cients stopping when the test rejects the hypothesis that the coef cients are jointly zero Each of these approaches has its aws and virtues The Akaike information criterion retains a positive probability of leading to over tting even as T In contrast SC p has been seen to lead to under tting in some nite sample cases They do avoid however the inference problems of sequential estimators The sequential F tests require successive revision of the signi cance level to be appropriate but they do have a statistical underpinning 4

    19 3

    SIMPLE DISTRIBUTED LAG MODELS

    Before examining some very general speci cations of the dynamic regression we brie y consider two speci c frameworks nite lag models which specify a particular value of the lag length p in 19 8 and an in nite lag model which emerges from a simple model of expectations
    19 3 1 FINITE DISTRIBUTED LAG MODELS

    An unrestricted nite distributed lag model would be speci ed as
    p

    yt
    i 0

    i xt i t

    19 11

    We assume that xt satis es the conditions discussed in Section 5 2 The assumption that there are no other regressors is just a convenience We also assume that t is distributed with mean zero and variance 2 If the lag length p is known then 19 11 is a classical regression model Aside from questions about the properties of the
    3 For

    further discussion and some alternative measures see Geweke and Meese 1981 Amemiya 1985 pp 146 147 Diebold 1998a pp 85 91 and Judge et al 1985 pp 353 355 Pagano and Hartley 1981 and Trivedi and Pagan 1979

    4 See

    Greene 50240

    book

    June 26 2002

    21 55

    566

    CHAPTER 19 Models with Lagged Variables

    independent variables the usual estimation results apply 5 But the appropriate length of the lag is rarely if ever known so one must undertake a speci cation search with all its pitfalls Worse yet least squares may prove to be rather ineffective because 1 time series are sometimes fairly short so 19 11 will consume an excessive number of degrees of freedom 6 2 t will usually be serially correlated and 3 multicollinearity is likely to be quite severe Restricted lag models which parameterize the lag coef cients as functions of a few underlying parameters are a practical approach to the problem of tting a model with long lags in a relatively short time series An example is the polynomial distributed lag PDL or Almon 1965 lag in reference to S Almon who rst proposed the method in econometrics The polynomial model assumes that the true distribution of lag coef cients can be well approximated by a low order polynomial i 0 1 i 2 i 2 pi q i 0 1 p q After substituting 19 12 in 19 11 and collecting terms we obtain
    p p p

    19 12

    yt 0
    i 0

    i 0 xt i

    1
    i 0

    i 1 xt i

    q
    i 0

    i q xt i

    t 19 13

    0 z0t 1 z1t q zqt t Each z jt is a linear combination of the current and p lagged values of xt With the assumption of strict exogeneity of xt and 0 1 q can be estimated by ordinary or generalized least squares The parameters of the regression model i and asymptotic standard errors for the estimators can then be obtained using the delta method see Section D 2 7 The polynomial lag model and other tightly structured nite lag models are only infrequently used in contemporary applications They have the virtue of simplicity although modern software has made this quality a modest virtue The major drawback is that they impose strong restrictions on the functional form of the model and thereby often induce autocorrelation that is essentially an artifact of the missing variables and restrictive functional form in the equation They remain useful tools in some forecasting settings and analysis of markets as in Example 19 3 but in recent work in macroeconomic and nancial modeling where most of this sort of analysis takes place the availability of ample data has made restrictive speci cations such as the PDL less attractive than other tools
    19 3 2 AN INFINITE LAG MODEL THE GEOMETRIC LAG MODEL

    There are cases in which the distributed lag models the accumulation of information The formation of expectations is an example In these instances intuition suggests that
    5 The question of whether the regressors are well behaved or not becomes particularly pertinent in this setting

    especially if one or more of them happen to be lagged values of the dependent variable In what follows we shall assume that the Grenander conditions discussed in Section 5 2 1 are met We thus assume that the usual asymptotic results for the classical or generalized regression model will hold
    6 Even

    when the time series is long the model may be problematic in this instance the assumption that the same model can be used without structural change through the entire time span becomes increasingly suspect the longer the time series is See Sections 7 4 and 7 7 for analysis of this issue

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    567

    the most recent past will receive the greatest weight and that the in uence of past observations will fade uniformly with the passage of time The geometric lag model is often used for these settings The general form of the model is


    yt
    i 1

    1 i xt i t 0 1 19 14

    B L xt t where B L 1 1 L 2 L2 3 L3 1 1 L

    The lag coef cients are i 1 i The model incorporates in nite lags but it assigns arbitrarily small weights to the distant past The lag weights decline geometrically wi 1 i 0 wi 1 The mean lag is w The median lag is p such that
    p 1 i 0 p

    B 1 B 1 1 wi 0 5 We can solve for p by using the result i 1 p 1 1

    i 0

    Thus p ln 0 5 1 ln

    The impact multiplier is 1 The long run multiplier is i 0 1 i The equilibrium value of yt would be found by xing xt at x and t at zero in 19 14 which produces y x The geometric lag model can be motivated with an economic model of expectations We begin with a regression in an expectations variable such as an expected future price based on information available at time t xt 1 t and perhaps a second regressor wt yt xt 1 t wt t and a mechanism for the formation of the expectation xt 1 t xt t 1 1 xt L xt 1 t 1 xt 19 15

    The currently formed expectation is a weighted average of the expectation in the previous period and the most recent observation The parameter is the adjustment coef cient If equals 1 then the current datum is ignored and expectations are never revised A value of zero characterizes a strict pragmatist who forgets the past immediately The expectation variable can be written as xt 1 t 1 xt 1 xt xt 1 2 xt 2 1 L 19 16

    Greene 50240

    book

    June 26 2002

    21 55

    568

    CHAPTER 19 Models with Lagged Variables

    Inserting 19 16 into 19 15 produces the geometric distributed lag model yt 1 xt xt 1 2 xt 2 wt t The geometric lag model can be estimated by nonlinear least squares Rewrite it as yt zt wt t 1 19 17

    The constructed variable zt obeys the recursion zt xt zt 1 For the rst observation we use z1 x1 0 x1 1 If the sample is moderately long then assuming that xt was in long run equilibrium although it is an approximation will not unduly affect the results One can then scan over the range of from zero to one to locate the value that minimizes the sum of squares Once the minimum is located an estimate of the asymptotic covariance matrix of the estimators of can be found using 9 9 and Theorem 9 2 For the regression function ht data xt01 1 xt02 zt and xt03 wt The derivative with respect to can be computed by using the recursion dt zt zt 1 zt 1 If z1 x1 1 then d1 z1 1 Then xt04 dt Finally we estimate from the relationship 1 and use the delta method to estimate the asymptotic standard error For purposes of estimating long and short run elasticities researchers often use a different form of the geometric lag model The partial adjustment model describes the desired level of yt yt xt wt t and an adjustment equation yt yt 1 1 yt yt 1 If we solve the second equation for yt and insert the rst expression for yt then we obtain yt 1 1 xt 1 wt yt 1 1 t xt wt yt 1 t This formulation offers a number of signi cant practical advantages It is intrinsically linear in the parameters unrestricted and its disturbance is nonautocorrelated if t was to begin with As such the parameters of this model can be estimated consistently and ef ciently by ordinary least squares In this revised formulation the short run multipliers for xt and wt are and The long run effects are 1 and 1 With the variables in logs these effects are the short and long run elasticities
    Example 19 2 Expectations Augmented Phillips Curve

    In Example 12 3 we estimated an expectations augmented Phillips curve of the form pt E pt
    t 1

    ut u t
    t 1

    This model assumes a particularly simple model of expectations E pt least squares results for this equation were pt pt 1 0 49189 0 090136 ut et 0 7405



    pt 1 The

    0 1257 R 2 0 002561 T 201

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    569

    Residual sum of Squares 2500

    2000 S L 1500 1000 0
    FIGURE 19 2

    2

    4

    6 LAMBDA I

    8

    1 0

    Sums of Squares for Phillips Curve Estimates

    The implied estimate of the natural rate of unemployment is 0 49189 0 090136 or about 5 46 percent Suppose we allow expectations to be formulated less pragmatically with the expectations model in 19 15 For this setting this would be E pt
    t 1

    E pt 1

    t 2

    1 pt 1

    The strict pragmatist has 0 0 Using the method set out earlier we would compute this for different values of recompute the dependent variable in the regression and locate the value of which produces the lowest sum of squares Figure 19 2 shows the sum of squares for the values of ranging from 0 0 to 1 0 The minimum value of the sum of squares occurs at 0 66 The least squares regression results are pt pt 1 1 69453 0 30427 ut et 0 6617 0 11125 T 201 The estimated standard errors are computed using the method described earlier for the nonlinear regression The extra variable described in the paragraph after 19 17 accounts for the estimated The estimated asymptotic covariance matrix is then computed using e e 201 W W 1 where w1 1 w2 ut and w3 pt 1 The estimated standard error for is 0 04610 Since this is highly statistically signi cantly different from zero t 14 315 we would reject the simple model Finally the implied estimate of the natural rate of unemployment is 1 69453 30427 or about 5 57 percent The estimated asymptotic covariance of the slope and constant term is 0 0720293 so using this value and the estimated standard errors given above and the delta method we obtain an estimated standard error for this estimate of 0 5467 Thus a con dence interval for the natural rate of unemployment based on these results would be 4 49 6 64 which is in line with our prior expectations There are two things to note about these results First since the dependent variables are different we cannot compare the R 2 s of the models with 0 00 and 0 66 But the sum of squares for the two models can be compared they are 1592 32 and 1112 89 so the second model

    Greene 50240

    book

    June 26 2002

    21 55

    570

    CHAPTER 19 Models with Lagged Variables

    TABLE 19 1

    Estimated Distributed Lag Models
    Expectations Partial Adjustment Estimated Derived Unrestricted Estimated Derived

    Coef cient

    Constant Ln Pnc Ln Puc Ln Ppt Trend Ln Pg Ln Pg 1 Ln Pg 2 Ln Pg 3 Ln Pg 4 Ln Pg 5 Ln income Ln Y 1 Ln Y 2 Ln Y 3 Ln Y 4 Ln Y 5 Zt price G Zt income Ln G pop 1 ee T
    Estimated

    18 165 0 190 0 0802 0 0754 0 0336 0 209 0 133 0 0820 0 0026 0 0585 0 0455 0 785 0 0138 0 696 0 0876 0 257 0 779 0 001649509 31

    18 080 0 0592 0 370 0 116 0 0399 0 171 0 113 0 074 0 049 0 032 0 021 0 877 0 298 0 101 0 034 0 012 0 004 0 171 0 877 0 502 2 580 0 66 0 0098409286 36

    5 133 0 139 0 126 0 051 0 0106

    14 102 0 382 0 346 0 140 0 029

    0 118 0 118 0 075 0 048 0 030 0 019 0 012 0 772 0 772 0 491 0 312 0 199 0 126 0 080 0 051 0 636 0 636 0 01250433 35

    directly

    ts far better One of the payoffs is the much narrower con dence interval for the natural rate The counterpart to the one given above when 0 00 is 1 13 9 79 No doubt the model could be improved still further by expanding the equation This is considered in the exercises
    Example 19 3 Price and Income Elasticities of Demand for Gasoline

    We have extended the gasoline demand equation estimated in Examples 2 3 4 4 and 7 6 to allow for dynamic effects Table 19 1 presents estimates of three distributed lag models for gasoline consumption The unrestricted model allows 5 years of adjustment in the price and income effects The expectations model includes the same distributed lag on price and income but different long run multipliers Pg and I Note for this formulation that the extra regressor used in computing the asymptotic covariance matrix is dt Pg dprice I dincome Finally the partial adjustment model implies lagged effects for all the variables in the model To facilitate comparison the constant and the rst four slope coef cients in the partial adjustment model have been divided by the estimate of 1 The implied long and short run price and income elasticities are shown in Table 19 2 The ancillary elasticities for the prices of new and used cars and for public transportation vary surprisingly widely across the models but the price and income elasticities are quite stable As might be expected the best t to the data is provided by the unrestricted lag model The sum of squares is far lower for this form than for the other two A direct comparison is dif cult because the models are not nested and because they are based on different numbers of observations As an approximation we can compute the sum of squared residuals for

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    571

    TABLE 19 2

    Estimated Elasticities
    Short Run Price Income Long Run Price Income

    Unrestricted model Expectations model Partial adjustment model

    0 209 0 170 0 118

    0 785 0 901 0 772

    0 270 0 502 0 324

    2 593 2 580 2 118

    the estimated distributed lag model using only the 31 observations used to compute the unrestricted model This sum of squares is 0 009551995087 An F statistic based on this sum of squares would be F 17 8 31 17 0 009551995 0 0016495090 9 7 4522 0 0016495090 14

    The 95 percent critical value for this distribution is 2 646 so the restrictions of the distributed lag model would be rejected The same computation same degrees of freedom for the partial adjustment model produces a sum of squares of 0 01215449 and an F of 9 68 Once again these are only rough indicators but they do suggest that the restrictions of the distributed lag models are inappropriate in the context of the model with ve lagged values for price and income

    19 4

    AUTOREGRESSIVE DISTRIBUTED LAG MODELS

    Both the nite lag models and the geometric lag model impose strong possibly incorrect restrictions on the lagged response of the dependent variable to changes in an independent variable A very general compromise that also provides a useful platform for studying a number of interesting methodological issues is the autoregressive distributed lag ARDL model
    p r

    yt
    i 1

    i yt i
    j 0

    j xt j wt t

    19 18

    in which t is assumed to be serially uncorrelated and homoscedastic we will relax both these assumptions in Chapter 20 We can write this more compactly as C L yt B L xt wt t by de ning polynomials in the lag operator C L 1 1 L 2 L2 p Lp and B L 0 1 L 2 L2 r Lr The model in this form is denoted ARDL p r to indicate the orders of the two polynomials in L The partial adjustment model estimated in the previous section is the special case in which p equals 1 and r equals 0 A number of other special cases are also interesting including the familiar model of autocorrelation p 1 r 1 1 1 0 the classical regression model p 0 r 0 and so on

    Greene 50240

    book

    June 26 2002

    21 55

    572

    CHAPTER 19 Models with Lagged Variables 19 4 1 ESTIMATION OF THE ARDL MODEL

    Save for the presence of the stochastic right hand side variables the ARDL is a linear model with a classical disturbance As such ordinary least squares is the ef cient estimator The lagged dependent variable does present a complication but we considered this in Section 5 4 Absent any obvious violations of the assumptions there least squares continues to be the estimator of choice Conventional testing procedures are as before asymptotically valid as well Thus for testing linear restrictions the Wald statistic can be used although the F statistic is generally preferable in nite samples because of its more conservative critical values One subtle complication in the model has attracted a large amount of attention in the recent literature If C 1 0 then the model is actually inestimable This fact is evident in the distributed lag form which includes a term C 1 If the equivalent condition i i 1 holds then the stochastic difference equation is unstable and a host of other problems arise as well This implication suggests that one might be interested in testing this speci cation as a hypothesis in the context of the model This restriction might seem to be a simple linear constraint on the alternative unrestricted model in 19 18 Under the null hypothesis however the conventional test statistics do not have the familiar distributions The formal derivation is complicated in the extreme see Dickey and Fuller 1979 for example but intuition should suggest the reason Under the null hypothesis the difference equation is explosive so our assumptions about well behaved data cannot be met Consider a simple ARDL 1 0 example and simplify it even further with B L 0 Then yt yt 1 t If equals 1 then yt yt 1 t Assuming we start the time series at time t 1 yt t
    s s

    t vt

    The conditional mean in this random walk with drift model is increasing without limit so the unconditional mean does not exist The conditional mean of the disturbance vt is zero but its conditional variance is t 2 which shows a peculiar type of heteroscedasticity Consider least squares estimation of with m t y t t where t 1 2 3 T Then E m E t t 1 t v but Var m 2
    T 3 t 1 t 2 T 2 t 1 t



    O T 4 O O T 3 2

    1 T2



    So the variance of this estimator is an order of magnitude smaller than we are used to seeing in regression models Not only is m mean square consistent it is superconsistent As such without doing a formal derivation we conclude that there is something unusual about this estimator and that the usual testing procedures whose distribu tions build on the distribution of T m will not be appropriate the variance of this normalized statistic converges to zero

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    573

    This result does not mean that the hypothesis 1 is not testable in this model In fact the appropriate test statistic is the conventional one that we have computed for comparable tests before But the appropriate critical values against which to measure those statistics are quite different We will return to this issue in our discussion of the Dickey Fuller test in Section 20 3 4
    19 4 2 COMPUTATION OF THE LAG WEIGHTS IN THE ARDL MODEL

    The distributed lag form of the ARDL model is yt 1 1 B L xt wt t C L C L C L C L 1 1 p


    j xt j
    j 0 l 0

    l wt l
    l 0

    l t l

    This model provides a method of approximating a very general lag structure In Jorgenson s 1966 study in which he labeled this model a rational lag model he demonstrated that essentially any desired shape for the lag distribution could be produced with relatively few parameters 7 The lag coef cients on xt xt 1 in the ARDL model are the individual terms in the ratio of polynomials that appear in the distributed lag form We denote these as coef cients 0 1 2 the coef cient on 1 L L2 in B L C L 19 19

    A convenient way to compute these coef cients is to write 19 19 as A L C L B L Then we can just equate coef cients on the powers of L Example 19 4 demonstrates the procedure The long run effect in a rational lag model is i 0 i This result is easy to compute since it is simply


    i
    i 0

    B 1 C 1

    A standard error for the long run effect can be computed using the delta method
    19 4 3 STABILITY OF A DYNAMIC EQUATION

    In the geometric lag model we found that a stability condition 1 was necessary for the model to be well behaved Similarly in the AR 1 model the autocorrelation parameter must be restricted to 1 for the same reason The dynamic model in 19 18 must also be restricted but in ways that are less obvious Consider once again the question of whether there exists an equilibrium value of yt In 19 18 suppose that xt is xed at some value x wt is xed at zero and the disturbances t are xed at their expectation of zero Would yt converge to an equilibrium
    7A

    long literature highlighted by Griliches 1967 Dhrymes 1971 Nerlove 1972 Maddala 1977a and Harvey 1990 describes estimation of models of this sort

    Greene 50240

    book

    June 26 2002

    21 55

    574

    CHAPTER 19 Models with Lagged Variables

    The relevant dynamic equation is yt 1 yt 1 2 yt 2 p yt p where B 1 x If yt converges to an equilibrium then that equilibrium is y B 1 x C 1 C 1

    Stability of a dynamic equation hinges on the characteristic equation for the autoregressive part of the model The roots of the characteristic equation C z 1 1 z 2 z2 p zp 0 19 20

    must be greater than one in absolute value for the model to be stable To take a simple example the characteristic equation for the rst order models we have examined thus far is C z 1 z 0 The single root of this equation is z 1 which is greater than one in absolute value if is less than one The roots of a more general characteristic equation are the reciprocals of the characteristic roots of the matrix 1 2 3 p 1 p 1 0 0 0 0 0 1 0 0 0 C 19 21 0 0 0 0 1 0 0 0 1 0 Since the matrix is asymmetric its roots may include complex pairs The reciprocal of the complex number a bi is a M b M i where M a 2 b2 and i 2 1 We thus require that M be less than 1 The case of z 1 the unit root case is often of special interest If one of the p roots of C z 0 is 1 then it follows that i 1 i 1 This assumption would appear to be a simple hypothesis to test in the framework of the ARDL model Instead we nd the explosive case that we examined in Section 19 4 1 so the hypothesis is more complicated than it rst appears To reiterate under the null hypothesis that C 1 0 it is not possible for the standard F statistic to have a central F distribution because of the behavior of the variables in the model We will return to this case shortly The univariate autoregression yt 1 yt 1 2 yt 2 p yt p t can be augmented with the p 1 equations yt 1 yt 1 yt 2 yt 2 and so on to give a vector autoregression VAR to be considered in the next section yt Cyt 1 t

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    575

    where yt has p elements t t 0 and 0 0 Since it will ultimately not be relevant to the solution we will let t equal its expected value of zero Now by successive substitution we obtain yt C C2 which may or may not converge Write C in the spectral form C P Q where QP I and is a diagonal matrix of the characteristic roots Note that the characteristic roots in and vectors in P and Q may be complex We then obtain


    yt
    i 0

    P

    i

    Q

    19 22

    If all the roots of C are less than one in absolute value then this vector will converge to the equilibrium y I C 1 Nonexplosion of the powers of the roots of C is equivalent to p 1 or 1 p 1 which was our original requirement Note nally that since is a multiple of the rst column of I p it must be the case that each element in the rst column of I C 1 is the same At equilibrium therefore we must have yt yt 1 y
    Example 19 4 A Rational Lag Model

    Appendix Table F5 1 lists quarterly data on a number of macroeconomic variables including consumption and disposable income for the U S economy for the years 1950 to 2000 a total of 204 quarters The model ct 0 yt 1 yt 1 2 yt 2 3 yt 3 1 ct 1 2 ct 2 3 ct 3 t is estimated using the logarithms of consumption and disposable income denoted ct and yt Ordinary least squares estimates of the parameters of the ARDL 3 3 model are ct 0 7233ct 1 0 3914ct 2 0 2337ct 3 0 5651 yt 0 3909 yt 1 0 2379 yt 2 0 902 yt 3 et A full set of quarterly dummy variables is omitted The Durbin Watson statistic is 1 78957 so remaining autocorrelation seems unlikely to be a consideration The lag coef cients are given by the equality 0 1 L 2 L 2 1 1 L 2 L 2 3 L 3 0 1 L 2 L 2 3 L 3 Note that A L is an in nite polynomial The lag coef cients are 1 0 0 which will always be the case 1 or 1 1 0 1 2 or 2 2 0 2 1 1 3 or 3 3 0 3 1 2 2 1 0 or 4 1 3 2 2 3 1 j 5 6 L 1 0 1 1 L 2 0 2 1 1 2 L 0 3 1 2 2 1 3
    3 4 j

    L 1 3 2 2 3 1 4

    L j 3 3 j 2 2 j 1 1 j 0 or j 1 j 1 2 j 2 3 j 3

    and so on From the fth term onward the series of lag coef cients follows the recursion j 1 j 1 2 j 2 3 j 3 which is the same as the autoregressive part of the ARDL model The series of lag weights follows the same difference equation as the current and

    Greene 50240

    book

    June 26 2002

    21 55

    576

    CHAPTER 19 Models with Lagged Variables

    TABLE 19 3

    Lag Coef cients in a Rational Lag Model 0 565 954 1 018 090 2 004 063 3 062 100 4 039 024 5 054 057 6 039 112 7 041 236

    Lag ARDL Unrestricted

    lagged values of yt after r initial values where r is the order of the DL part of the ARDL model The three characteristic roots of the C matrix are 0 8631 0 5949 and 0 4551 Since all are less than one we conclude that the stochastic difference equation is stable The rst seven lag coef cients of the estimated ARDL model are listed in Table 19 3 with the rst seven coef cients in an unrestricted lag model The coef cients from the ARDL model only vaguely resemble those from the unrestricted model but the erratic swings of the latter are prevented by the smooth equation from the distributed lag model The estimated longterm effects with standard errors in parentheses from the two models are 1 0634 0 00791 from the ARDL model and 1 0570 0 002135 from the unrestricted model Surprisingly in view of the large and highly signi cant estimated coef cients the lagged effects fall off essentially to zero after the initial impact
    19 4 4 FORECASTING

    Consider rst a one period ahead forecast of yt in the ARDL p r model It will be convenient to collect the terms in xt wt and so on in a single term
    r

    t
    j 0

    j xt j wt

    Now the ARDL model is just yt t 1 yt 1 p yt p t Conditioned on the full set of information available up to time T and on forecasts of the exogenous variables the one period ahead forecast of yt would be yT 1 T T 1 T 1 yT p yT p 1 T 1 T To form a prediction interval we will be interested in the variance of the forecast error eT 1 T yT 1 T yT 1 This error will arise from three sources First in forecasting t there will be two sources of error The parameters and 0 r will have been estimated so T 1 T will differ from T 1 because of the sampling variation in these estimators Second if the exogenous variables xT 1 and wT 1 have been forecasted then to the extent that these forecasts are themselves imperfect yet another source of error to the forecast will result Finally although we will forecast T 1 with its expectation of zero we would not assume that the actual realization will be zero so this step will be a third source of error In principle an estimate of the forecast variance Var eT 1 T would account for all three sources of error In practice handling the second of these errors is largely intractable while the rst is merely extremely dif cult See Harvey 1990 and Hamilton 1994 especially Section 11 7 for useful discussion McCullough 1996 presents results that suggest that intractable may be too pessimistic For the moment we will concentrate on the third source and return to the other issues brie y at the end of the section

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    577

    Ignoring for the moment the variation in T 1 T that is assuming that the param eters are known and the exogenous variables are forecasted perfectly the variance of the forecast error will be simply Var eT 1 T xT 1 wT 1 yT Var T 1 2 so at least within these assumptions forming the forecast and computing the forecast variance are straightforward Also at this rst step given the data used for the forecast the rst part of the variance is also tractable Let zT 1 1 xT 1 xT xT r 1 wT yT yT 1 yT p 1 and let denote the full estimated parameter vector Then we would use Est Var eT 1 T zT 1 s 2 zT 1 Est Asy Var zT 1 Now consider forecasting further out beyond the sample period yT 2 T T 2 T 1 yT 1 T p yT p 2 T 2 T Note that for period T 1 the forecasted yT 1 is used Making the substitution for yT 1 T we have yT 2 T T 2 T 1 T 1 T 1 yT p yT p 1 T 1 T p yT p 2 T 2 T and likewise for subsequent periods Our method will be simpli ed considerably if we use the device we constructed in the previous section For the rst forecast period write the forecast with the previous p lagged values as y yT 1 T T 1 T T 1 T T 1 2 p yT 0 1 0 0 yT 1 0 yT 1 0 yT 2 0 1 0 0 0 1 0 The coef cient matrix on the right hand side is C which we de ned in 19 21 To maintain the thread of the discussion we will continue to use the notation T 1 T for the forecast of the deterministic part of the model although for the present we are assuming that this value as well as C is known with certainty With this modi cation then our forecast is the top element of the vector of forecasts yT 1 T T 1 T CyT T 1 T Since we are assuming that everything on the right hand side is known except the period T 1 disturbance the covariance matrix for this p 1 vector is 2 0 0 E yT 1 T yT 1 yT 1 T yT 1 0 and the forecast variance for yT 1 T is just the upper left element 2 Now extend this notation to forecasting out to periods T 2 T 3 and so on yT 2 T T 2 T CyT 1 T T 2 T T 2 T C T 1 T C2 yT T 2 T C T 1 T

    Greene 50240

    book

    June 26 2002

    21 55

    578

    CHAPTER 19 Models with Lagged Variables

    Once again the only unknowns are the disturbances so the forecast variance for this two period ahead forecasted vector is 2 2 0 0 C 0 C 0 0 Var T 2 T C T 1 T 0 Thus the forecast variance for the two step ahead forecast is 2 1 1 11 where 1 11 is the 1 1 element of 1 Cjj C where j 0 0 By extending this device to a forecast F periods beyond the sample period we obtain
    F F

    yT F T
    f 1

    C f 1 T F f 1 T C F yT
    f 1

    C f 1 t F f 1 T

    19 23

    This equation shows how to compute the forecasts which is reasonably simple We also obtain our expression for the conditional forecast variance Conditional Var yT F T 2 1 1 11 2 11 F 1 11 19 24 where i Ci jj Ci The general form of the F period ahead forecast shows how the forecasts will behave as the forecast period extends further out beyond the sample period If the equation is stable that is if all roots of the matrix C are less than one in absolute value then C F will converge to zero and since the forecasted disturbances are zero the forecast will be dominated by the sum in the rst term If we suppose in addition that the forecasts of the exogenous variables are just the period T 1 forecasted values and not revised then as we found at the end of the previous section the forecast will ultimately converge to
    F

    lim yT F T T 1 T I C 1 T 1 T

    To account fully for all sources of variation in the forecasts we would have to revise the forecast variance to include the variation in the forecasts of the exogenous variables and the variation in the parameter estimates As noted the rst of these is likely to be intractable For the second this revision will be extremely dif cult the more so when we also account for the matrix C as well as the vector being built up from the estimated parameters One consolation is that in the presence of a lagged value of the dependent variable as approaches one the parameter variances tend to order 1 T 2 rather than the 1 T we are accustomed to With this faster convergence the variation due to parameter estimation becomes less important See Section 20 3 3 for related results The level of dif culty in this case falls from impossible to merely extremely dif cult In principle what is required is Est Conditional Var yT F T 2 1 1 11 2 11 F 1 11 g Est Asy Var g where g yT F

    See Hamilton 1994 Appendix to Chapter 11 for formal derivation

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    579

    One possibility is to use the bootstrap method For this application bootstrapping would involve sampling new sets of disturbances from the estimated distribution of t and then repeatedly rebuilding the within sample time series of observations on yt by using yt t 1 yt 1 p yt p ebt m where ebt m is the estimated bootstrapped disturbance in period t during replication m The process is repeated M times with new parameter estimates and a new forecast generated in each replication The variance of these forecasts produces the estimated forecast variance 8

    19 5

    METHODOLOGICAL ISSUES IN THE ANALYSIS OF DYNAMIC MODELS
    AN ERROR CORRECTION MODEL

    19 5 1

    Consider the ARDL 1 1 model which has become a workhorse of the modern literature on time series analysis By de ning the rst differences yt yt yt 1 and xt xt xt 1 we can rearrange yt 1 yt 1 0 xt 1 xt 1 t to obtain yt 0 xt 1 1 yt 1 xt 1 t 19 25

    where 0 1 1 1 This form of the model is in the error correction form In this form we have an equilibrium relationship yt 0 xt t and the equilibrium error 1 1 yt 1 xt 1 which account for the deviation of the pair of variables from that equilibrium The model states that the change in yt from the previous period consists of the change associated with movement with xt along the long run equilibrium path plus a part 1 1 of the deviation yt 1 xt 1 from the equilibrium With a model in logs this relationship would be in proportional terms It is useful at this juncture to jump ahead a bit we will return to this topic in some detail in Chapter 20 and explore why the error correction form might be such a useful formulation of this simple model Consider the logged consumption and income data plotted in Figure 19 3 It is obvious on inspection of the gure that a simple regression of the log of consumption on the log of income would suggest a highly signi cant relationship in fact the simple linear regression produces a slope of 1 0567 with a t ratio of 440 5 and an R2 of 0 99896 The disturbing result of a line of literature in econometrics that begins with Granger and Newbold 1974 and continues to the present is that this seemingly obvious and powerful relationship might be entirely spurious Equally obvious from the gure is that both ct and yt are trending variables If in fact both variables unconditionally were random walks with drift of the sort that we met at the end of Section 19 4 1 that is ct t c vt and likewise for yt then we would almost certainly observe a gure such as 19 3 and compelling regression results such as those even if there were no relationship at all In addition there is ample evidence
    8 Bernard

    and Veall 1987 give an application of this technique See also McCullough 1996

    Greene 50240

    book

    June 26 2002

    21 55

    580

    CHAPTER 19 Models with Lagged Variables

    9 6
    CT YT

    9 0

    Variable

    8 4

    7 8

    7 2

    6 6 1949
    FIGURE 19 3

    1962

    1975 Quarter

    1988

    2001

    Consumption and Income Data

    in the recent literature that low frequency infrequently observed aggregated over long periods ow variables such as consumption and output are indeed often well described as random walks In such data the ARDL 1 1 model might appear to be entirely appropriate even if it is not So how is one to distinguish between the spurious regression and a genuine relationship as shown in the ARDL 1 1 The rst difference of consumption produces ct c vt vt 1 If the random walk proposition is indeed correct then the spurious appearance of regression will not survive the rst differencing whereas if there is a relationship between ct and yt then it will be preserved in the error correction model We will return to this issue in Chapter 20 when we examine the issue of integration and cointegration of economic variables
    Example 19 5 An Error Correction Model for Consumption

    The error correction model is a nonlinear regression model although in fact it is intrinsically linear and can be deduced simply from the unrestricted form directly above it Since the parameter is actually of some interest it might be more convenient to use nonlinear least squares and t the second form directly Since the model is intrinsically linear the nonlinear least squares estimates will be identical to the derived linear least squares estimates The logs of consumption and income data in Appendix Table F5 1 are plotted in Figure 19 3 Not surprisingly the two variables are drifting upward together The estimated error correction model with estimated standard errors in parentheses is ct ct 1 0 08533 0 90458 1 ct 1 1 06034 yt 1 0 58421 yt yt 1 0 02899 0 03029 0 01052 0 05090

    The estimated equilibrium errors are shown in Figure 19 4 Note that they are all positive but that in each period the adjustment is in the opposite direction Thus according to this model when consumption is below its equilibrium value the adjustment is upward as might be expected

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    581

    095

    090 EQERROR 085 080 1950
    FIGURE 19 4

    1963

    1976 Quarter

    1989

    2002

    Consumption Income Equilibrium Errors

    19 5 2

    AUTOCORRELATION

    The disturbance in the error correction model is assumed to be nonautocorrelated As we saw in Chapter 12 autocorrelation in a model can be induced by misspeci cation An orthodox view of the modeling process might state in fact that this misspeci cation is the only source of autocorrelation Although admittedly a bit optimistic in its implication this misspeci cation does raise an interesting methodological question Consider once again the simplest model of autocorrelation from Chapter 12 with a small change in notation to make it consistent with the present discussion yt xt vt vt vt 1 t 19 26 where t is nonautocorrelated As we found earlier this model can be written as yt yt 1 xt xt 1 t or yt yt 1 xt xt 1 t 19 28 This model is an ARDL 1 1 model in which 1 1 0 Thus we can view 19 28 as a restricted version of yt 1 yt 1 0 xt 1 xt 1 t 19 29 19 27

    The crucial point here is that the nonlinear restriction on 19 29 is testable so there is no compelling reason to proceed to 19 26 rst without establishing that the restriction is in fact consistent with the data The upshot is that the AR 1 disturbance model as a general proposition is a testable restriction on a simpler linear model not necessarily a structure unto itself

    Greene 50240

    book

    June 26 2002

    21 55

    582

    CHAPTER 19 Models with Lagged Variables

    Now let us take this argument to its logical conclusion The AR p disturbance model vt 1 vt 1 p vt p t or R L vt t can be written in its moving average form as t vt R L Recall in the AR 1 model that t ut ut 1 2 ut 2 The regression model with this AR p disturbance is therefore t yt xt R L But consider instead the ARDL p p model C L yt B L xt t These coef cients are the same model if B L C L The implication is that any model with an AR p disturbance can be interpreted as a nonlinearly restricted version of an ARDL p p model The preceding discussion is a rather orthodox view of autocorrelation It is predicated on the AR p model Researchers have found that a more involved model for the process generating t is sometimes called for If the time series structure of t is not autoregressive much of the preceding analysis will become intractable As such there remains room for disagreement with the strong conclusions We will turn to models whose disturbances are mixtures of autoregressive and moving average terms which would be beyond the reach of this apparatus in Chapter 20
    19 5 3 SPECIFICATION ANALYSIS

    The usual explanation of autocorrelation is serial correlation in omitted variables The preceding discussion and our results in Chapter 12 suggest another candidate misspeci cation of what would otherwise be an unrestricted ARDL model Thus upon nding evidence of autocorrelation on the basis of a Durbin Watson statistic or an LM statistic we might nd that relaxing the nonlinear restrictions on the ARDL model is a preferable next step to correcting for the autocorrelation by imposing the restrictions and re tting the model by FGLS Since an ARDL p r model with AR disturbances even with p 0 is implicitly an ARDL p d r d model where d is usually one the approach suggested is just to add additional lags of the dependent variable to the model Thus one might even ask why we would ever use the familiar FGLS procedures See e g Mizon 1995 The payoff is that the restrictions imposed by the FGLS procedure produce a more ef cient estimator than other methods If the restrictions are in fact appropriate then not imposing them amounts to not using information A related question now arises apart from the issue of autocorrelation In the context of the ARDL model how should one do the speci cation search This question is not speci c to the ARDL or even to the time series setting Is it better to start with a small model and expand it until conventional t measures indicate that additional variables are no longer improving the model or is it better to start with a large model and pare away variables that conventional statistics suggest are super uous The rst strategy

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    583

    going from a simple model to a general model is likely to be problematic because the statistics computed for the narrower model are biased and inconsistent if the hypothesis is incorrect Consider for example an LM test for autocorrelation in a model from which important variables have been omitted The results are biased in favor of a nding of autocorrelation The alternative approach is to proceed from a general model to a simple one Thus one might over t the model and then subject it to whatever battery of tests are appropriate to produce the correct speci cation at the end of the procedure In this instance the estimates and test statistics computed from the over t model although inef cient are not generally systematically biased We have encountered this issue at several points The latter approach is common in modern analysis but some words of caution are needed The procedure routinely leads to over tting the model A typical time series analysis might involve specifying a model with deep lags on all the variables and then paring away the model as conventional statistics indicate The danger is that the resulting model might have an autoregressive structure with peculiar holes in it that would be hard to justify with any theory Thus a model for quarterly data that includes lags of 2 3 6 and 9 on the dependent variable would look suspiciously like the end result of a computer driven shing trip and moreover might not survive even moderate changes in the estimation sample As Hendry 1995 notes a model in which the largest and most signi cant lag coef cient occurs at the last lag is surely misspeci ed
    19 5 4 COMMON FACTOR RESTRICTIONS

    The preceding discussion suggests that evidence of autocorrelation in a time series regression model might signal more than merely a need to use generalized least squares to make ef cient use of the data See Hendry 1993 If we nd evidence of autocorrelation based say on the Durbin Watson statistic or on Durbin s h statistic then it would make sense to test the hypothesis of the AR 1 model that might normally be the next step against the alternative possibility that the model is merely misspeci ed The test is suggested by 19 27 and 19 28 In general we can formulate it as a test of H0 yt xt yt 1 xt 1 t versus H1 yt xt yt 1 xt 1 t The null model is obtained from the alternative by the nonlinear restriction Since the models are both classical regression models the test can be carried out by referring the F statistic F J T K1 e0 e0 e1 e1 J e1 e1 T K

    to the appropriate critical value from the F distribution The test is only asymptotically valid because of the nonlinearity of the restricted regression and because of the lagged dependent variables in the models There are two additional complications in this procedure First the unrestricted model may be unidenti ed because of redundant variables For example it will usually have two constant terms If both zt and zt 1 appear in the restricted equation then zt 1 will appear twice in the unrestricted model and so on

    Greene 50240

    book

    June 26 2002

    21 55

    584

    CHAPTER 19 Models with Lagged Variables

    The solution is simple just drop the redundant variables The sum of squares without the redundant variables will be identical to that with them Second at rst blush the restrictions in the nonlinear model appear complicated The restricted model however is actually quite straightforward Rewrite it in a familiar form H0 yt yt 1 xt xt 1 t Given the regression is linear In this form the grid search over the values of can be used to obtain the full set of estimates Cochrane Orcutt and the other two step estimators are likely not to be the best solution Also it is important to search the full 0 1 range to allow for the possibility of local minima of the sum of squares Depending on the available software it may be equally simple just to t the nonlinear regression model directly Higher order models can be handled analogously In an AR 1 model this common factor restriction the reason for the name will be clear shortly takes the form 1 L yt 0 1 L x1 t 1 0

    Consider instead an AR 2 model The restricted and unrestricted models would appear as H0 1 1 L 2 L2 yt 1 1 L 2 L2 xt t H1 yt 1 yt 1 2 yt 2 xt 0 xt 1 1 xt 2 2 t so the full set of restrictions is 1 1 0 and 2 2 0 This expanded model can be handled analogously to the AR 1 model Once again an F test of the nonlinear restrictions can be used This approach neglects another possibility The restricted model above goes the full distance from the unrestricted model to the AR 2 autocorrelation model There is an intermediate possibility The polynomials in the lag operator C L and B L can be factored into products of linear primitive terms A quadratic equation in L for example may always be written as C L 1 1 L 2 L2 1 1 L 1 2 L where the s are the roots of the characteristic polynomial C z 0 Here B L may be factored likewise say into 1 1 L 1 2 L These roots may include pairs of imaginary values With these results in hand rewrite the basic model C L yt B L xt t in the form 1 1 L 1 2 L yt 1 1 L 1 2 L xt t Now suppose that 1 1 Dividing through both sides of the equation by 1 L produces the restricted model t 1 2 L yt 1 2 L xt 1 L The restricted model is a lower order autoregression which has some virtue but now by construction its disturbance is an AR 1 process in This conclusion was expected of course since we reached it in reverse at the beginning of this section The restricted model is appropriate only if the two polynomials have a common factor 1 2 1 2 hence the name for the procedure

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    585

    It is useful to develop this procedure in more detail for an ARDL 2 2 model Write the distributed lag part B L as 0 1 1 L 2 L2 Multiplying out the factors we see that the unrestricted model yt 1 yt 1 2 yt 2 0 1 1 L 2 L2 xt t can be written as yt 1 2 yt 1 1 2 yt 2 0 xt 0 1 2 xt 1 0 1 2 xt 2 t Despite what appears to be extreme nonlinearity this equation is intrinsically linear In fact it cannot be estimated in this form by nonlinear least squares since any pair of values 1 2 that one might nd can just be reversed and the function and sum of squares will not change The same is true for pairs of 1 2 Of course this information is irrelevant to the solution since the model can be t by ordinary linear least squares in the ARDL form just above it and for the test we only need the sum of squares But now impose the common factor restriction 1 1 1 1 or 1 1 The now very nonlinear regression model yt 1 2 yt 1 1 2 yt 2 0 xt 0 1 2 xt 1 0 1 2 xt 2 t has six terms on the right hand side but only ve parameters and is overidenti ed This model can be t as is by nonlinear least squares The F test of one restriction suggested earlier can now be carried out Note that this test of one common factor restriction is a test of the hypothesis of the ARDL 1 1 model with an AR 1 disturbance against the unrestricted ARDL 2 2 model Turned around we note once again a nding of autocorrelation in the ARDL 1 1 model does not necessarily suggest that one should just use GLS The appropriate next step might be to expand the model Finally testing both common factor restrictions in this model is equivalent to testing the two restrictions 1 1 and 2 2 in the model yt 1 yt 1 2 yt 2 xt 1 xt 1 2 xt 2 t The unrestricted model is the linear ARDL 2 2 we used earlier The restricted model is nonlinear but it can be estimated easily by nonlinear least squares The analysis of common factors in models more complicated than ARDL 2 2 is extremely involved See Hendry 1993 and Hendry and Doornik 1996
    Example 19 6 Testing Common Factor Restrictions

    The consumption and income data used in Example 19 5 quarters 1950 3 to 2000 4 are used to t an unrestricted ARDL 2 2 model ct 1 ct 1 2 ct 2 0 yt 1 yt 1 2 yt 2 t Ordinary least squares estimates of the parameters appear in Table 19 4 For the one common factor model the parameters are formulated as ct 1 2 ct 1 1 2 ct 2 0 yt 0 1 2 yt 1 0 1 2 yt 2 t The structural parameters are computed using nonlinear least squares and then the ARDL coef cients are computed from these A two common factors model is obtained by imposing the additional restriction 2 2 The resulting model is the familiar one ct 1 ct 1 2 ct 2 0 yt 1 yt 1 2 yt 2 t

    Greene 50240

    book

    June 26 2002

    21 55

    586

    CHAPTER 19 Models with Lagged Variables

    TABLE 19 4

    Estimated Autoregressive Distributed Lag Models
    Parameter 1 2 0 1 2

    Restrictions

    ee

    2

    0 04020 0 6959 0 03044 0 5710 0 006397 0 06741 0 06747 0 04229 Estimated 1 0 6959 2 0 3044

    0 3974 0 04563

    0 1739 0 0091238 0 04206 0 2596 0 0088736 0 06685 0 1329 0 0088626 0 06218

    1

    0 006499 0 6456 0 2724 0 5972 0 6104 0 02959 0 06866 0 06784 0 04342 0 07225 Estimated 1 0 2887 2 0 8992 2 0 9433 0 06628 0 03014 0 6487 0 07066 0 2766 0 6126 0 06935 0 05408 0 4004 0 08759

    0

    Standard errors are given in parentheses As expected they decline generally as the restrictions are added The sum of squares increases at the same time The F statistic for one restriction is 0 0088736 0 0088626 1 F 0 243 0 0088626 202 6 The 95 percent critical value from the F 1 119 table is 3 921 so the hypothesis of the single common factor cannot be rejected The F statistic for two restrictions is 5 777 against a critical value of 3 072 so the hypothesis of the AR 2 disturbance model is rejected

    19 6

    VECTOR AUTOREGRESSIONS

    The preceding discussions can be extended to sets of variables The resulting autoregressive model is yt 1 yt 1 p yt p t 19 30 where t is a vector of nonautocorrelated disturbances innovations with zero means and contemporaneous covariance matrix E t t This equation system is a vector autoregression or VAR Equation 19 30 may also be written as L yt t where L is a matrix of polynomials in the lag operator The individual equations are
    p p p

    ymt m
    j 1



    j m1 y1 t j


    j 1



    j m2 y2 t j


    j 1



    j mM yM t j

    mt

    where j lm indicates the l m element of j VARs have been used primarily in macroeconomics Early in their development it was argued by some authors e g Sims 1980 Litterman 1979 1986 that VARs would forecast better than the sort of structural equation models discussed in Chapter 15 One could argue that as long as includes the current observations on the truly relevant exogenous variables the VAR is simply an over t reduced form of some simultaneous equations model See Hamilton 1994 pp 326 327 The over tting results from the possible inclusion of more lags than would be appropriate in the original model See Example 19 8 for a detailed discussion of one such model On the other hand one of the virtues of the VAR is that it obviates a decision as to what contemporaneous variables

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    587

    are exogenous it has only lagged predetermined variables on the right hand side and all variables are endogenous The motivation behind VARs in macroeconomics runs deeper than the statistical issues 9 The large structural equations models of the 1950s and 1960s were built on a theoretical foundation that has not proved satisfactory That the forecasting performance of VARs surpassed that of large structural models some of the later counterparts to Klein s Model I ran to hundreds of equations signaled to researchers a more fundamental problem with the underlying methodology The Keynesian style systems of equations describe a structural model of decisions consumption investment that seem loosely to mimic individual behavior see Keynes s formulation of the consumption function in Example 1 1 that is perhaps the canonical example In the end however these decision rules are fundamentally ad hoc and there is little basis on which to assume that they would aggregate to the macroeconomic level anyway On a more practical level the high in ation and high unemployment experienced in the 1970s were very badly predicted by the Keynesian paradigm From the point of view of the underlying paradigm the most troubling criticism of the structural modeling approach comes in the form of the Lucas critique 1976 in which the author argued that the parameters of the decision rules embodied in the systems of structural equations would not remain stable when economic policies changed even if the rules themselves were appropriate Thus the paradigm underlying the systems of equations approach to macroeconomic modeling is arguably fundamentally awed More recent research has reformulated the basic equations of macroeconomic models in terms of a microeconomic optimization foundation and has at the same time been much less ambitious in specifying the interrelationships among economic variables The preceding arguments have drawn researchers to less structured equation systems for forecasting Thus it is not just the form of the equations that has changed The variables in the equations have changed as well the VAR is not just the reduced form of some structural model For purposes of analyzing and forecasting macroeconomic activity and tracing the effects of policy changes and external stimuli on the economy researchers have found that simple small scale VARs without a possibly awed theoretical foundation have proved as good as or better than large scale structural equation systems In addition to forecasting VARs have been used for two primary functions testing Granger causality and studying the effects of policy through impulse response characteristics
    19 6 1 MODEL FORMS

    To simplify things for the present we note that the rst order VAR as follows 1 2 yt 0 I 0 yt 1 yt p 1 0 0 I

    pth order VAR can be written as a t yt 1 0 yt 2 0 0 0 yt p 0
    p



    9 An extremely readable nontechnical discussion of the paradigm shift in macroeconomic forecasting is given

    in Diebold 1998b See also Stock and Watson 2001

    Greene 50240

    book

    June 26 2002

    21 55

    588

    CHAPTER 19 Models with Lagged Variables

    This means that we do not lose any generality in casting the treatment in terms of a rst order model yt yt 1 t In Section 18 5 we examined Dahlberg and Johansson s model for municipal nances in Sweden in which yt St Rt Gt where St is spending Rt is receipts and Gt is grants from the central government and p 3 We will continue that application in Example 19 8 below In principle the VAR model is a seemingly unrelated regressions model indeed a particularly simple one since each equation has the same set of regressors This is the traditional form of the model as originally proposed for example by Sims 1980 The VAR may also be viewed as the reduced form of a simultaneous equations model the corresponding structure would then be yt yt 1 t

    where is a nonsingular matrix and Var In one of Cecchetti and Rich s 2001 formulations for example yt yt t where yt is the log of aggregate real output 1 12 t is the in ation rate from time t 1 to time t and p 8 We will 1 21 examine their model in Section 19 6 8 In this form we have a conventional simultaneous equations model which we analyzed in detail in Chapter 15 As we saw in order for such a model to be identi ed that is estimable certain restrictions must be placed on the structural coef cients The reason for this is that ultimately only the original VAR form now the reduced form is estimated from the data the structural parameters must be deduced from these coef cients In this model in order to deduce these structural parameters they must be extracted from the reduced form parame 1 ters 1 1 and 1 We analyzed this issue in detail in Section 15 3 The results would be the same here In Cecchetti and Rich s application certain restrictions were placed on the lag coef cients in order to secure identi cation
    19 6 2 ESTIMATION

    In the form of 19 30 that is without autocorrelation of the disturbances VARs are particularly simple to estimate Although the equation system can be exceedingly large it is in fact a seemingly unrelated regressions model with identical regressors As such the equations should be estimated separately by ordinary least squares See Section 14 4 2 for discussion of SUR systems with identical regressors The disturbance covariance matrix can then be estimated with average sums of squares or cross products of the least squares residuals If the disturbances are normally distributed then these least squares estimators are also maximum likelihood If not then OLS remains an ef cient GMM estimator The extension to instrumental variables and GMM is a bit more complicated as the model now contains multiple equations see Section 14 4 but since the equations are all linear the necessary extensions are at least relatively straightforward GMM estimation of the VAR system is a special case of the model discussed in Section 14 4 We will examine an application below in Example 20 8 The proliferation of parameters in VARs has been cited as a major disadvantage of their use Consider for example a VAR involving ve variables and three lags Each

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    589

    has 25 unconstrained elements and there are three of them for a total of 75 free parameters plus any others in plus 5 6 2 15 free parameters in On the other hand each single equation has only 25 parameters and at least given suf cient degrees of freedom there s the rub a linear regression with 25 parameters is simple work Moreover applications rarely involve even as many as four variables so the model size issue may well be exaggerated
    19 6 3 TESTING PROCEDURES

    Formal testing in the VAR setting usually centers either on determining the appropriate lag length a speci cation search or on whether certain blocks of zeros in the coef cient matrices are zero a simple linear restriction on the collection of slope parameters Both types of hypotheses may be treated as sets of linear restrictions on the elements in vec 1 2 p We begin by assuming that the disturbances have a joint normal distribution Let W be the M M residual covariance matrix based on a restricted model and let W be its counterpart when the model is unrestricted Then the likelihood ratio statistic T ln W ln W can be used to test the hypothesis The statistic would have a limiting chi squared distribution with degrees of freedom equal to the number of restrictions In principle one might base a speci cation search for the right lag length on this calculation The procedure would be to test down from say lag q to lag to p The general to simple principle discussed in Section 19 5 3 would be to set the maximum lag length and test down from it until deletion of the last set of lags leads to a signi cant loss of t At each step at which the alternative lag model has excess terms the estimators of the super uous coef cient matrices would have probability limits of zero and the likelihood function would again asymptotically resemble that of the model with the correct number of lags Formally suppose the appropriate lag length is p but the model is t with q p 1 lagged terms Then under the null hypothesis q T ln W
    1 q 1

    ln W

    1

    q

    2 M2

    d

    The same approach would be used to test other restrictions Thus the Granger causality test noted below would t the model with and without certain blocks of zeros in the coef cient matrices then refer the value of once again to the chi squared distribution For speci cation searches for the right lag the suggested procedure may be less effective than one based on the information criteria suggested for other linear models see Section 8 4 Lutkepohl 1993 pp 128 135 suggests an alternative approach based on the minimizing functions of the information criteria we have considered earlier ln W pM2 M IC T T where T is the sample size p is the number of lags M is the number of equations and IC T 2 for the Akaike information criterion and ln T for the Schwartz Bayesian information criterion We should note this is not a test statistic it is a diagnostic tool that we are using to conduct a speci cation search Also as in all such cases the testing procedure should be from a larger one to a smaller one to avoid the misspeci cation problems induced by a lag length that is smaller than the appropriate one

    Greene 50240

    book

    June 26 2002

    21 55

    590

    CHAPTER 19 Models with Lagged Variables

    The preceding has relied heavily on the normality assumption Since most recent applications of these techniques have either treated the least squares estimators as robust distribution free estimators or used GMM as we did in Chapter 18 it is necessary to consider a different approach that does not depend on normality An alternative approach which should be robust to variations in the underlying distributions is the Wald statistic See Lutkepohl 1993 pp 93 95 The full set of coef cients in the model may be arrayed in a single coef cient vector Let c be the sample estimator of and let V denote the estimated asymptotic covariance matrix Then the hypothesis in question lag length or other linear restriction can be cast in the form R q 0 The Wald statistic for testing the null hypothesis is W Rc q RVR 1 Rc q Under the null hypothesis this statistic has a limiting chi squared distribution with degrees of freedom equal to J the number of restrictions rows in R For the speci cation search for the appropriate lag length or the Granger causality test discussed in the next section the null hypothesis will be that a certain subvector of say 0 equals zero In this case the statistic will be W0 c 0 V 1 c 0 00 where V00 denotes the corresponding submatrix of V Since time series data sets are often only moderately long use of the limiting distribution for the test statistic may be a bit optimistic Also the Wald statistic does not account for the fact that the asymptotic covariance matrix is estimated using a nite sample In our analysis of the classical linear regression model we accommodated these considerations by using the F distribution instead of the limiting chi squared See Section 6 4 The adjustment made was to refer W J to the F J T K distribution This produces a more conservative test the corresponding critical values of JF converge of to those of the chi squared from above A remaining complication is to decide what degrees of freedom to use for the denominator It might seem natural to use MT minus the number of parameters which would be correct if the restrictions are imposed on all equations simultaneously since there are that many observations In testing for causality as in Section 19 6 5 below Lutkepohl 1993 p 95 argues that MT is excessive since the restrictions are not imposed on all equations When the causality test involves testing for zero restrictions within a single equation the appropriate degrees of freedom would be T Mp 1 for that one equation
    19 6 4 EXOGENEITY

    In the classical regression model with nonstochastic regressors there is no ambiguity about which is the independent or conditioning or exogenous variable in the model yt 1 2 xt t 19 31

    This is the kind of characterization that might apply in an experimental situation in which the analyst is choosing the values of xt But the case of nonstochastic regressors has little to do with the sort of modeling that will be of interest in this and the next chapter There is no basis for the narrow assumption of nonstochastic regressors and in fact in most of the analysis that we have done to this point we have left this assumption

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    591

    far behind With stochastic regressor s the regression relationship such as the one above becomes a conditional mean in a bivariate distribution In this more realistic setting what constitutes an exogenous variable becomes ambiguous Assuming that the regression relationship is linear 19 31 can be written trivially as yt E yt xt y E yt xt where the familiar moment condition E xt t 0 follows by construction But this form of the model is no more the correct equation than would be xt 1 2 yt t which is we assume xt E xt yt xt E xt yt and now E yt t 0 Since both equations are correctly speci ed in the context of the bivariate distribution there is nothing to de ne one variable or the other as exogenous This might seem puzzling but it is in fact at the heart of the matter when one considers modeling in a world in which variables are jointly determined The de nition of exogeneity depends on the analyst s understanding of the world they are modeling and in the nal analysis on the purpose to which the model is to be put The methodological platform on which this discussion rests is the classic paper by Engle Hendry and Richard 1983 where they point out that exogeneity is not an absolute concept at all it is de ned in the context of the model The central idea which will be very useful to us here is that we de ne a variable set of variables as exogenous in the context of our model if the joint density may be written f yt xt f yt xt f xt where the parameters in the conditional distribution do not appear in and are functionally unrelated to those in the marginal distribution of xt By this arrangement we can think of autonomous variation of the parameters of interest The parameters in the conditional model for yt xt can be analyzed as if they could vary independently of those in the marginal distribution of xt If this condition does not hold then we cannot think of variation of those parameters without linking that variation to some effect in the marginal distribution of xt In this case it makes little sense to think of xt as somehow being determined outside the conditional model We considered this issue in Section 15 8 in the context of a simultaneous equations model A second form of exogeneity we will consider is strong exogeneity which is sometimes called Granger noncausality Granger noncausality can be super cially de ned by the assumption E yt yt 1 xt 1 xt 2 E yt yt 1 That is lagged values of xt do not provide information about the conditional mean of yt once lagged values of yt itself are accounted for We will consider this issue at the end of this chapter For the present we note that most of the models we will examine will explicitly fail this assumption To put this back in the context of our model we will be assuming that in the model yt 1 2 xt 3 xt 1 yt 1 t

    Greene 50240

    book

    June 26 2002

    21 55

    592

    CHAPTER 19 Models with Lagged Variables

    and the extensions that we will consider xt is weakly exogenous we can meaningfully estimate the parameters of the regression equation independently of the marginal distribution of xt but we will allow for Granger causality between xt and yt thus generally not assuming strong exogeneity
    19 6 5 TESTING FOR GRANGER CAUSALITY

    Causality in the sense de ned by Granger 1969 and Sims 1972 is inferred when lagged values of a variable say xt have explanatory power in a regression of a variable yt on lagged values of yt and xt See Section 15 2 2 The VAR can be used to test the hypothesis 10 Tests of the restrictions can be based on simple F tests in the single equations of the VAR model That the unrestricted equations have identical regressors means that these tests can be based on the results of simple OLS estimates The notion can be extended in a system of equations to attempt to ascertain if a given variable is weakly exogenous to the system If lagged values of a variable xt have no explanatory power for any of the variables in a system then we would view x as weakly exogenous to the system Once again this speci cation can be tested with a likelihood ratio test as described below the restriction will be to put holes in one or more matrices or with a form of F test constructed by stacking the equations
    Example 19 7 Granger Causality11

    All but one of the major recessions in the U S economy since World War II have been preceded by large increases in the price of crude oil Does movement of the price of oil cause movements in U S GDP in the Granger sense Let yt GDP crude oil price t Then a simple VAR would be yt 1 1 2 1 2 2 yt 1 1t 2t

    To assert a causal relationship between oil prices and GDP we must nd that 2 is not zero previous movements in oil prices do help explain movements in GDP even in the presence of the lagged value of GDP Consistent with our earlier discussion this fact in itself is not suf cient to assert a causal relationship We would also have to demonstrate that there were no other intervening explanations that would explain movements in oil prices and GDP We will examine a more extensive application in Example 19 9

    To establish the general result it will prove useful to write the VAR in the multivariate regression format we used in Section 14 4 2 Partition the two data vectors yt and xt into y1t y2t and x1t x2t Consistent with our earlier discussion x1 is lagged values of y1 and x2 is lagged values of y2 The VAR with this partitioning would be y1 y2
    11 21 12 22

    x1 1 x2 2

    Var

    1t 2t

    11 21

    12 22



    We would still obtain the unrestricted maximum likelihood estimates by least squares regressions For testing Granger causality the hypothesis 12 0 is of interest See Example 19 7 This model is the block of zeros case examined in Section 14 2 6 The full set of results we need are derived there For testing the hypothesis of interest 12 0 the second set of equations is irrelevant For testing for Granger causality in
    10 See

    Geweke Meese and Dent 1983 Sims 1980 and Stock and Watson 2001 example is adapted from Hamilton 1994 pp 307 308

    11 This

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    593

    the VAR model only the restricted equations are relevant The hypothesis can be tested using the likelihood ratio statistic For the present application testing means computing S11 residual covariance matrix when current values of y1 are regressed on values of both x1 and x2 S11 0 residual covariance matrix when current values of y1 are regressed only on values of x1 The likelihood ratio statistic is then T ln S11 0 ln S11 The number of degrees of freedom is the number of zero restrictions As discussed earlier the fact that this test is wedded to the normal distribution limits its generality The Wald test or its transformation to an approximate F statistic as described in Section 19 6 3 is an alternative that should be more generally applicable When the equation system is t by GMM as in Example 19 8 the simplicity of the likelihood ratio test is lost The Wald statistic remains usable however Another possibility is to use the GMM counterpart to the likelihood ratio statistic see Section 18 4 2 based on the GMM criterion functions This is just the difference in the GMM criteria Fitting both restricted and unrestricted models in this framework may be burdensome but having set up the GMM estimator for the larger unrestricted model imposing the zero restrictions of the smaller model should require only a minor modi cation There is a complication in these causality tests The VAR can be motivated by the Wold representation theorem see Section 20 2 5 Theorem 20 1 although with assumed nonautocorrelated disturbances the motivation is incomplete On the other hand there is no formal theory behind the formulation As such the causality tests are predicated on a model that may in fact be missing either intervening variables or additional lagged effects that should be present but are not For the rst of these the problem is that a nding of causal effects might equally well result from the omission of a variable that is correlated with both of or all the left hand side variables
    19 6 6 IMPULSE RESPONSE FUNCTIONS

    Any VAR can be written as a rst order model by augmenting it if necessary with additional identity equations For example the model yt can be written yt yt 1 0
    1 2 1 yt 1



    2 yt 2

    vt

    I

    0

    vt yt 1 yt 2 0

    which is a rst order model We can study the dynamic characteristics of the model in either form but the second is more convenient as will soon be apparent As we analyzed earlier in the model yt yt 1 vt dynamic stability is achieved if the characteristic roots of have modulus less than one The roots may be complex because need not be symmetric See Section 19 4 3 for

    Greene 50240

    book

    June 26 2002

    21 55

    594

    CHAPTER 19 Models with Lagged Variables

    the case of a single equation and Section 15 9 for analysis of essentially this model in a simultaneous equations context Assuming that the equation system is stable the equilibrium is found by obtaining the nal form of the system We can do this step by repeated substitution or more simply by using the lag operator to write yt L yt vt or I L yt vt With the stability condition we have yt I L 1 vt I 1
    i i 0 i i 0

    vt i 19 32

    y

    vt i
    2

    y vt vt 1

    vt 2

    The coef cients in the powers of are the multipliers in the system In fact by renaming things slightly this set of results is precisely the one we examined in Section 15 9 in our discussion of dynamic simultaneous equations models We will change the interpretation slightly here however As we did in Section 15 9 we consider the conceptual experiment of disturbing a system in equilibrium Suppose that v has equaled 0 for long enough that y has reached equilibrium y Now we consider injecting a shock to the system by changing one of the v s for one period and then returning it to zero thereafter As we saw earlier ymt will move away from then return to its equilibrium The path whereby the variables return to the equilibrium is called the impulse response of the VAR 12 In the autoregressive form of the model we can identify each innovation vmt with a particular variable in yt say ymt Consider then the effect of a one time shock to the system dvmt As compared with the equilibrium we will have in the current period ymt ym dvmt mm 0 dvt One period later we will have ym t 1 ym mmdvmt mm 1 dvt Two periods later ym t 2 ym
    2

    mmdvmt mm 2 dvt

    and so on The function mm i gives the impulse response characteristics of variable ym to innovations in vm A useful way to characterize the system is to plot the impulse response functions The preceding traces through the effect on variable m of a
    12 See

    Hamilton 1994 pp 318 323 and 336 350 for discussion and a number of related results

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    595

    one time innovation in vm We could also examine the effect of a one time innovation of vl on variable m The impulse response function would be ml i element m l in
    i



    Point estimation of ml i using the estimated model parameters is straightforward Con dence intervals present a more dif cult problem because the estimated functions ml i are so highly nonlinear in the original parameter estimates The delta method has thus proved unsatisfactory Killian 1998 presents results that suggest that bootstrapping may be the more productive approach to statistical inference regarding impulse response functions
    19 6 7 STRUCTURAL VARs

    The VAR approach to modeling dynamic behavior of economic variables has provided some interesting insights and appears see Litterman 1986 to bring some real bene ts for forecasting The method has received some strident criticism for its atheoretical approach however The unrestricted nature of the lag structure in 19 30 could be synonymous with unstructured With no theoretical input to the model it is dif cult to claim that its output provides much of a theoretically justi ed result For example how are we to interpret the impulse response functions derived in the previous section What lies behind much of this discussion is the idea that there is in fact a structure underlying the model and the VAR that we have speci ed is a mere hodgepodge of all its components Of course that is exactly what reduced forms are As such to respond to this sort of criticism analysts have begun to cast VARs formally as reduced forms and thereby attempt to deduce the structure that they had in mind all along A VAR model yt yt 1 vt could in principle be viewed as the reduced form of the dynamic structural model yt yt 1 t

    where we have embedded any exogenous variables xt in the vector of constants Thus 1 1 v 1 and 1 1 Perhaps it is the structure speci ed by an underlying theory that is of interest For example we can discuss the impulse response characteristics of this system For particular con gurations of such as a triangular matrix we can meaningfully interpret innovations As we explored at great length in the previous chapter however as this model stands there is not suf cient information contained in the reduced form as just stated to deduce the structural parameters A possibly large number of restrictions must be imposed on and to enable us to deduce structural forms from reduced form estimates which are always obtainable The recent work on structural VARs centers on the types of restrictions and forms of the theory that can be brought to bear to allow this analysis to proceed See for example the survey in Hamilton 1994 Chapter 11 At this point the literature on this subject has come full circle because the contemporary development of unstructured VARs becomes very much the analysis of quite conventional dynamic structural simultaneous equations models Indeed current research e g Diebold 1998a brings the literature back into line with the structural modeling tradition by demonstrating how VARs can be derived formally as the reduced forms of dynamic structural models That is the most recent applications have begun with structures and derived the reduced

    Greene 50240

    book

    June 26 2002

    21 55

    596

    CHAPTER 19 Models with Lagged Variables

    forms as VARs rather than departing from the VAR as a reduced form and attempting to deduce a structure from it by layering on restrictions
    19 6 8 APPLICATION POLICY ANALYSIS WITH A VAR

    Cecchetti and Rich 2001 used a structural VAR to analyze the effect of recent disin ationary policies of the Fed on aggregate output in the U S economy The Fed s policy of the last two decades has leaned more toward controlling in ation and less toward stimulation of the economy The authors argue that the long run bene ts of this policy include economic stability and increased long term trend output growth But there is a short term cost in lost output Their study seeks to estimate the sacri ce ratio which is a measure of the cumulative cost of this policy The speci c indicator they study measures the cumulative output loss after periods of a policy shock at time t where the persistent shock is measured as the change in the level of in ation
    19 6 8a A VAR Model for the Macroeconomic Variables

    The model proposed for estimating the ratio is a structural VAR
    p p i b11 i 1 p 0 t b21 yt i 1 i b21 yt i i 1

    yt

    yt i

    0 b12

    t
    i 1 p

    i b12 t i t

    y

    i b22 t i t

    where yt is aggregate real output in period t and t is the rate of in ation from period t 1 to t and the model is cast in terms of rates of changes of these two variables Note therefore that sums of t measure accumulated changes in the rate of in ation y not changes in the CPI The innovations t t t is assumed to have mean 0 contemporaneous covariance matrix E t t and to be strictly nonautocorrelated We have retained Cecchetti and Rich s notation for most of this discussion save for the number of lags which is denoted n in their paper and p here and some other minor changes which will be noted in passing where necessary 13 The equation system may also be written B L yt t t
    y

    t

    where B L is a 2 2 matrix of polynomials in the lag operator The components of the disturbance innovation vector t are identi ed as shocks to aggregate supply and aggregate demand respectively
    19 6 8b The Sacri ce Ratio

    Interest in the study centers on the impact over time of structural shocks to output and the rate of in ation In order to calculate these the authors use the vector moving
    13 The authors examine two other VAR models a three equation model of Shapiro and Watson 1988 which

    adds an equation in real interest rates i t t and a four equation model by Gali 1992 which models yt i t i t t and the real money stock mt t Among the foci of Cecchetti and Rich s paper was the surprisingly large variation in estimates of the sacri ce ratio produced by the three models In the interest of brevity we will restrict our analysis to Cecchetti s 1994 two equation model

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    597

    average VMA form of the model which would be t t A11 L yt B L 1 A L t A21 L t t iy i i 0 a11 t i i 0 a12 t i iy i a21 t i a22 t i i 0 i 0
    y y

    A12 L A22 L

    t t

    y

    Note that the superscript i in the last form of the model above is not an exponent it is the index of the sequence of coef cients The impulse response functions for the model corresponding to 19 30 are precisely the coef cients in A L In particular the effect on the change in in ation periods later of a change in t in period t is a22 i The total effect from time t 0 to time t would be the sum of these i 0 a22 The i counterparts for the rate of output would be i 0 a12 However what is needed is not the effect only on period s output but the cumulative effect on output from the time of the shock up to period That would be obtained by summing these period speci c i effects to obtain i 0 ij 0 a12 Combining terms the sacri ce ratio is S
    j 0

    t t

    yt j t



    0 i 0

    i a12

    1 i 0

    i a12 i i 0 a22

    i 0

    i a12



    i 0

    i i j 0 a12 i i 0 a22

    The function S is then examined over long periods to study the long term effects of monetary policy
    19 6 8c Identi cation and Estimation of a Structural VAR Model

    Estimation of this model requires some manipulation The structural model is a conventional linear simultaneous equations model of the form B0 yt Bxt t where yt is yt t and xt is the lagged values on the right hand side As we saw in Section 15 3 1 without further restrictions a model such as this is not identi ed estimable A total of M2 restrictions M is the number of equations here two are needed to identify the model In the familiar cases of simultaneous equations models that we examined in Chapter 15 identi cation is usually secured through exclusion restrictions that is zero restrictions either in B0 or B This type of exclusion restriction would be unnatural in a model such as this one there would be no basis for poking speci c holes in the coef cient matrices The authors take a different approach which requires us to look more closely at the different forms the time series model can take Write the structural form as B0 yt B1 yt 1 B2 yt 2 B p yt p t where B0 1
    0 b21 0 b12

    1



    As noted this is in the form of a conventional simultaneous equations model Assuming 00 that B0 is nonsingular which for this two equation system requires only that 1 b12 b21

    Greene 50240

    book

    June 26 2002

    21 55

    598

    CHAPTER 19 Models with Lagged Variables

    not equal zero we can obtain the reduced form of the model as yt B 1 B1 yt 1 B 1 B2 yt 2 B 1 B p yt p B 1 t 0 0 0 0 D1 yt 1 D2 yt 2 D p yt p t 19 33

    where t is the vector of reduced form innovations Now collect the terms in the equivalent form I D1L D2L2 yt t The moving average form that we obtained earlier is yt I D1L D2L2 1 t Assuming stability of the system we can also write this as yt I D1L D2L2 1 t I D1L D2L2 1 B 1 t 0 I C1L C2L2 t t C1 t 1 C2 t 2 B 1 t C1 t 1 C2 t 2 0 So the C j matrices correspond to our A j matrices in the original formulation But this manipulation has added something We can see that A0 B 1 Looking ahead the 0 reduced form equations can be estimated by least squares Whether the structural parameters and thereafter the VMA parameters can as well depends entirely on whether B0 can be estimated From 19 33 we can see that if B0 can be estimated then B1 B p can also just by premultiplying the reduced form coef cient matrices by this estimated B0 So we must now consider this issue This is precisely the conclusion we drew at the beginning of Section 15 3 Recall the initial assumption that E t t In the reduced form we assume E t t As we know reduced forms are always estimable indeed by least squares if the assumptions of the model are correct That means that is estimable by the least squares residual variances and covariance From the earlier derivation we have that B 1 B 1 A0 A0 Again see the beginning of Section 15 3 The authors 0 0 have secured identi cation of the model through this relationship In particular they assume rst that I Assuming that I we now have that A0 A0 where is an estimable matrix with three free parameters Since A0 is 2 2 one more restriction is needed to secure identi cation At this point the authors invoking Blanchard and Quah 1989 assume that demand shocks have no permanent effect on the level of i output This is equivalent to A12 1 i 0 a12 0 This might seem like a cumbersome restriction to impose But the matrix A 1 is I D1 D2 D p 1 A0 FA0 and the components D j have been estimated as the reduced form coef cient matrices so A12 1 0 assumes only that the upper right element of this matrix is zero We now obtain the equations needed to solve for A0 First 02 02 00 00 a11 a21 a12 a22 a11 a12 11 12 19 34 A0 A0 2 2 12 11 a0 a0 a0 a0 a0 a0
    11 21 12 22 21 22

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    599

    which provides three equations Second the theoretical restriction is FA0
    0 0 f11 a12 f12 a22 0

    This provides the four equations needed to identify the four elements in A0 14 Collecting results the estimation strategy is rst to estimate D1 D p and in the reduced form by least squares They set p 8 Then use the restrictions and 19 34 to obtain the elements of A0 B 1 and nally B j A 1 D j 0 0 The last step is estimation of the matrices of impulse responses which can be done as follows We return to the reduced form which using our augmentation trick we write as D1 D2 D p y A0 t yt t 1 0 0 yt 2 0 yt 1 I 19 35 0 yt p 1 yt p 0 0 I 0 For convenience arrange this result as Yt D L Yt wt Now solve this for Yt to obtain the nal form Yt I D L 1 wt Write this in the spectral form and expand as we did earlier to obtain


    Yt
    i 0

    P

    i

    Qwt i

    19 36

    this point an intriguing loose end arises We have carried this discussion in the form of the original papers by Blanchard and Quah 1989 and Cecchetti and Rich 2001 Returning to the original structure however we see that since A0 B 1 it actually does not have four unrestricted and unknown elements 0 it has two The model is overidenti ed We could have predicted this at the outset As in our conventional simultaneous equations model the normalizations in B0 ones on the diagonal provide two restrictions of the M2 4 required Assuming that I provides three more and the theoretical restriction provides a sixth Therefore the four unknown elements in an unrestricted B0 are overidenti ed The assumption that I in itself may be a substantive and strong restriction In the original data that Cecchetti and Rich used over the period of their estimation the unconditional variances of yt and t are 0 923 and 0 676 The latter is far enough below one that one might expect this assumption actually to be substantive It might seem convenient at this point to forego the theoretical restriction on long term impacts but it seems more natural to omit the restrictions on the scaling of With the two normalizations already in place assuming that the innovations are uncorrelated is diagonal and demand shocks have no permanent effect on the level of output together suf ce to identify the model Blanchard and Quah appear to reach the same conclusion page 656 but then they also assume the unit variances page 657 equation 1 They argue that the assumption of unit variances is just a convenient normalization but this is not the case Since the model is already identi ed without the assumption the scaling restriction is substantive Once again this is clear from a look at the structure The assumption that B0 has ones on its diagonal has already scaled the equation In fact this is logically identical to assuming that the disturbance in a conventional regression model has variance one which one normally would not do

    14 At

    Greene 50240

    book

    June 26 2002

    21 55

    600

    CHAPTER 19 Models with Lagged Variables

    We will be interested in the uppermost subvector of Yt so we expand 19 36 to yield A0 t i yt yt 1 P i Q 0 i 0 yt p 1 0 The matrix in the summation is Mp Mp The impact matrices we seek are the M M matrices in the upper left corner of the spectral form multiplied by A0
    19 6 8d Inference

    As noted at the end of Section 19 6 6 obtaining usable standard errors for estimates of impulse responses is a dif cult as yet unresolved problem Killian 1998 has suggested that bootstrapping is a preferable approach to using the delta method Cecchetti and Rich reach the same conclusion and likewise resort to a bootstrapping procedure Their bootstrap procedure is carried out as follows Let and denote the full set of estimated coef cients and estimated reduced form covariance matrix based on direct estimation As suggested by Doan 1996 they construct a sequence of N draws for the reduced form parameters then recompute the entire set of impulse responses The narrowest interval which contains 90 percent of these draws is taken to be a con dence interval for an estimated impulse function
    19 6 8e Empirical Results

    Cecchetti and Rich used quarterly observations on real aggregate output and the consumer price index Their data set spanned 1959 1 to 1997 4 This is a subset of the data described in the Appendix Table F5 1 Before beginning their analysis they subjected the data to the standard tests for stationarity Figures 19 5 through 19 7 show
    FIGURE 19 5 Log GDP

    Log Real GDP 1959 1 1997 4 9 2 9 0 8 8 8 6 8 4 8 2 8 0 7 8 7 6 1958

    LOGGDP

    1963

    1968

    1973

    1978 Quarter

    1983

    1988

    1993

    1998

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    601

    Inflation Rate 1959 1 1997 4 05

    04

    03 INFL

    02

    01

    00 01 1958

    1963

    1968

    1973

    1978 Quarter

    1983

    1988

    1993

    1998

    FIGURE 19 6

    The Quarterly Rate of In ation

    FIGURE 19 7

    Rates of Change logGDP and the Rate of In ation

    First Differences of logGDP and Inflation 4 0
    DLOGY DPI

    2 7

    Variable

    1 5

    2

    1 0

    2 3 1959

    1964

    1969

    1974

    1979 Quarter

    1984

    1989

    1994

    1999

    Greene 50240

    book

    June 26 2002

    21 55

    602

    CHAPTER 19 Models with Lagged Variables

    the log of real output the rate of in ation and the changes in these two variables The rst two gures do suggest that neither variable is stationary On the basis of the Dickey Fuller 1981 test see Section 20 3 they found as might be expected that the yt and t series both contain unit roots They conclude that since output has a unit root the identi cation restriction that the long run effect of aggregate demand shocks on output is well de ned and meaningful The unit root in in ation allows for permanent shifts in its level The lag length for the model is set at p 8 Long run impulse response function are truncated at 20 years 80 quarters Analysis is based on the rate of change data shown in Figure 19 7 As a nal check on the model the authors examined the data for the possibility of a structural shift using the tests described in Section 7 5 None of the Andrews Quandt supremum LM test Andrews Ploberger exponential LM test or the Andrews Ploberger average LM test suggested that the underlying structure had changed in spite of what seems likely to have been a major shift in Fed policy in the 1970s On this basis they concluded that the VAR is stable over the sample period Figure 19 8 Figures 3A and 3B taken from the article shows their two separate estimated impulse response functions The dotted lines in the gures show the bootstrap generated con dence bounds Estimates of the sacri ce ratio for Cecchetti s model are 1 3219 for 4 1 3204 for 8 1 5700 for 12 1 5219 for 16 and 1 3763 for 20 The authors also examined the forecasting performance of their model compared to Shapiro and Watson s and Gali s The device used was to produce one step ahead period T 1 T forecasts for the model estimated using periods 1 T The rst reduced form of the model is t using 1959 1 to 1975 1 and used to forecast 1975 2 Then it is reestimated using 1959 1 to 1975 2 and used to forecast 1975 3 and so on Finally the root mean squared error of these out of sample forecasts is compared for three models In each case the level rather than the rate of change of the in ation rate is forecasted Overall the results suggest that the smaller model does a better job of estimating the impulse responses has smaller con dence bounds and conforms more nearly with theoretical predictions but performs worst of the three slightly in terms of the mean squared error of the out of sample forecasts Since the unrestricted reduced form model is being used for the latter this comes as no surprise The end result follows essentially from the result that adding variables to a regression model improves its t
    19 6 9 VARs IN MICROECONOMICS

    VARs have appeared in the microeconometrics literature as well Chamberlain 1980 suggested that a useful approach to the analysis of panel data would be to treat each period s observation as a separate equation For the case of T 2 we would have yi 1 i xi 1 i 1 yi 2 i xi 2 i 2 where i indexes individuals and i are unobserved individual effects This speci cation produces a multivariate regression to which Chamberlain added restrictions related to the individual effects Holtz Eakin Newey and Rosen s 1988 approach is to specify

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    603

    A Dynamic Response to a Monetary Policy Shock Real GDP Cecchetti 0 6 0 4 0 2 0 0 0 2 Log 0 4 0 6 0 8 1 0 1 2 1 4 0 5 10 15 20

    B Dynamic Response to a Monetary Policy Shock Inflation Cecchetti 0 75 0 50 0 25 0 00 Percent 0 25 0 50 0 75 1 00 1 25 1 50 1 75 0
    FIGURE 19 8

    5

    10

    15

    20

    Estimated Impulse Response Functions

    Greene 50240

    book

    June 26 2002

    21 55

    604

    CHAPTER 19 Models with Lagged Variables

    the equation as
    m m

    yit 0t
    l 1

    lt yi t l
    l 1

    lt xi t l

    t fi

    it

    In their study yit is hours worked by individual i in period t and xit is the individual s wage in that period A second equation for earnings is speci ed with lagged values of hours and earnings on the right hand side The individual unobserved effects are fi This model is similar to the VAR in 19 30 but it differs in several ways as well The number of periods is quite small 14 yearly observations for each individual but there are nearly 1000 individuals The dynamic equation is speci ed for a speci c period however so the relevant sample size in each case is n not T Also the number of lags in the model used is relatively small the authors xed it at three They thus have a twoequation VAR containing 12 unknown parameters six in each equation The authors used the model to analyze causality measurement error and parameter stability that is constancy of lt and lt across time
    Example 19 8 VAR for Municipal Expenditures

    In Section 18 5 we examined a model of municipal expenditures proposed by Dahlberg and Johansson 2000 Their equation of interest is
    m m m

    Si t t
    j 1

    j

    Si t j
    j 1

    j

    Ri t j
    j 1

    j Gi t j uiSt

    for i 1 N 265 and t m 1 9 Si t Ri t and Gi t are municipal spending receipts taxes and fees and central government grants respectively Analogous equations are speci ed for the current values of Ri t and Gi t This produces a vector autoregression for each municipality



    S 1 S t Ri t R t R 1 G t Gi t G 1 Si t









    S 1 R 1 G 1

    S 1 G 1



    R 1 Ri t 1 Gi t 1

    Si t 1





    S m G m

    S m R m G m

    R m

    uS S m Si t m i t R m Ri t m uiRt
    G m Gi t m uiGt

    The model was estimated by GMM so the discussion at the end of the preceding section applies here We will be interested in testing whether changes in municipal spending Si t are Granger caused by changes in revenues Ri t and grants Gi t The hypothesis to be tested is S j S j 0 for all j This hypothesis can be tested in the context of only the rst equation Parameter estimates and diagnostic statistics are given in Section 17 5 We can carry out the test in two ways In the unrestricted equation with all three lagged values of all three variables the minimized GMM criterion is q 22 8287 If the lagged values of R and G are omitted from the S equation the criterion rises to 42 9182 15 There are 6 restrictions The difference is 20 090 so the F statistic is 20 09 6 3 348 We have over 1 000 degrees of freedom for the denominator with 265 municipalities and 5 years so we can use the limiting value for the critical value This is 2 10 so we may reject the hypothesis of noncausality and conclude that changes in revenues and grants do Granger cause changes in spending
    15 Once

    again these results differ from those given by Dahlberg and Johansson As before the difference results from our use of the same weighting matrix for all GMM computations in contrast to their recomputation of the matrix for each new coef cient vector estimated

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    605

    This seems hardly surprising The alternative approach is to use a Wald statistic to test the six restrictions Using the full GMM results for the S equation with 14 coef cients we obtain a Wald statistic of 15 3030 The critical chi squared would be 6 2 1 12 6 so once again the hypothesis is rejected Dahlberg and Johansson approach the causality test somewhat differently by using a sequential testing procedure See their page 413 for discussion They suggest that the intervening variables be dropped in turn By dropping rst G then R and G and then rst R then G and R they conclude that grants do not Granger cause changes in spending q only 07 but in the absence of grants revenues do q grants excluded 24 6 The reverse order produces test statistics of 12 2 and 12 4 respectively Our own calculations of the four values of q yields 22 829 for the full model 23 1302 with only grants excluded 23 0894 with only R excluded and 42 9182 with both excluded which disagrees with their results but is consistent with our earlier ones
    Instability of a VAR Model

    The coef cients for the three variable VAR model in Example 19 8 appear in Table 18 4 The characteristic roots of the 9 9 coef cient matrix are 0 6025 0 2529 0 0840 1 4586 0 6584i 0 6992 0 2019i and 0 0611 0 6291i The rst pair of complex roots has modulus greater than one so the estimated VAR is unstable The data do not appear to be consistent with this result though with only ve useable years of data that conclusion is a bit fragile One might suspect that the model is over t Since the disturbances are assumed to be uncorrelated across equations the three equations have been estimated separately The GMM criterion for the system is then the sum of those for the three equations For m 3 2 and 1 respectively these are 22 8287 30 5398 17 5810 70 9495 30 4526 34 2590 20 5416 85 2532 and 34 4986 53 2506 27 5927 115 6119 The difference statistic for testing down from three lags to two is 14 3037 The critical chi squared for nine degrees of freedom is 19 62 so it would appear that m 3 may be too large The results clearly reject the hypothesis that m 1 however The coef cients for a model with two lags instead of one appear in Table 17 4 If we construct from these results instead we obtain a 6 6 matrix whose characteristic roots are 1 5817 0 2196 0 3509 0 4362i and 0 0968 0 2791i The system remains unstable

    19 7

    SUMMARY AND CONCLUSIONS

    This chapter has surveyed a particular type of regression model the dynamic regression The signature feature of the dynamic model is effects that are delayed or that persist through time In a static regression setting effects embodied in coef cients are assumed to take place all at once In the dynamic model the response to an innovation is distributed through several periods The rst three sections of this chapter examined several different forms of single equation models that contained lagged effects The progression which mirrors the current literature is from tightly structured lag models which were sometimes formulated to respond to a shortage of data rather than to correspond to an underlying theory to unrestricted models with multiple period lag structures We also examined several hybrids of these two forms models that allow long lags but build some regular structure into the lag weights Thus our model of the formation of expectations of in ation is reasonably exible but does assume a speci c behavioral mechanism We then examined several methodological issues In this context as elsewhere there is a preference in the methods toward forming broad unrestricted models and using familiar inference tools to reduce them to the nal appropriate speci cation The second half of the chapter was devoted to a type of seemingly unrelated

    Greene 50240

    book

    June 26 2002

    21 55

    606

    CHAPTER 19 Models with Lagged Variables

    regressions model The vector autoregression or VAR has been a major tool in recent research After developing the econometric framework we examined two applications one in macroeconomics centered on monetary policy and one from microeconomics Key Terms and Concepts
    Autocorrelation Autoregression Autoregressive distributed Finite lags General to simple method Granger noncausality Impact multiplier Impulse response In nite lag model In nite lags Innovation Invertible Lagged variables Lag operator Lag weight Mean lag Median lag Moving average form One period ahead forecast Partial adjustment Phillips curve Polynomial in lag operator Polynomial lag model Random walk with drift Rational lag Simple to general approach Speci cation Stability Stationary Strong exogeneity Structural model Structural VAR Superconsistent Univariate autoregression Vector autoregression

    lag
    Autoregressive form Autoregressive model Characteristic equation Common factor Distributed lag Dynamic regression model Elasticity Equilibrium Equilibrium error Equilibrium multiplier Equilibrium relationship Error correction Exogeneity Expectation

    VAR
    Vector moving average

    VMA

    Exercises 1 Obtain the mean lag and the long and short run multipliers for the following distributed lag models a yt 0 55 0 02xt 0 15xt 1 0 43xt 2 0 23xt 3 0 17xt 4 et b The model in Exercise 5 c The model in Exercise 6 Do for either x or z Explain how to estimate the parameters of the following model yt xt yt 1 yt 2 et et et 1 ut Is there any problem with ordinary least squares Let yt be consumption and let xt be disposable income Using the method you have described t the previous model to the data in Appendix Table F5 1 Report your results Show how to estimate a polynomial distributed lag model with lags of six periods and a third order polynomial Expand the rational lag model yt 0 6 2 L 1 0 6 L 0 5 L2 xt et What are the coef cients on xt xt 1 xt 2 xt 3 and xt 4 Suppose that the model of Exercise 4 were speci ed as yt L xt et 1 1 L 2 L2

    2

    3 4 5

    Greene 50240

    book

    June 26 2002

    21 55

    CHAPTER 19 Models with Lagged Variables

    607

    6

    Describe a method of estimating the parameters Is ordinary least squares consistent Describe how to estimate the parameters of the model xt zt yt t 1 L 1 L where t is a serially uncorrelated homoscedastic classical disturbance We are interested in the long run multiplier in the model
    6

    7

    yt 0
    j 0

    j xt j t

    Assume that xt is an autoregressive series xt r xt 1 vt where r 1 a What is the long run multiplier in this model b How would you estimate the long run multiplier in this model c Suppose you that the preceding is the true model but you linearly regress yt only on a constant and the rst 5 lags of xt How does this affect your estimate of the long run multiplier d Same as c for 4 lags instead of 5 e Using the macroeconomic data in Appendix F5 1 let yt be the log of real investment and xt be the log of real output Carry out the computations suggested and report your ndings Speci cally how does the omission of a lagged value affect estimates of the short run and long run multipliers in the unrestricted lag model

    Greene 50240

    book

    June 27 2002

    21 11

    20

    TIME SERIES MODELS

    Q
    20 1 INTRODUCTION For forecasting purposes a simple model that describes the behavior of a variable or a set of variables in terms of past values without the bene t of a well developed theory may well prove quite satisfactory Researchers have observed that the large simultaneous equations macroeconomic models constructed in the 1960s frequently have poorer forecasting performance than fairly simple univariate time series models based on just a few parameters and compact speci cations It is just this observation that has raised to prominence the univariate time series forecasting models pioneered by Box and Jenkins 1984 In this chapter we introduce some of the tools employed in the analysis of timeseries data 1 Section 20 2 describes stationary stochastic processes We encountered this body of theory in Chapters 12 16 and 19 where we discovered that certain assumptions were required to ascribe familiar properties to a time series of data We continue that discussion by de ning several characteristics of a stationary time series The recent literature in macroeconometrics has seen an explosion of studies of nonstationary time series Nonstationarity mandates a revision of the standard inference tools we have used thus far In Section 20 3 on nonstationarity and unit roots we discuss some of these tools Section 20 4 on cointegration discusses some extensions of regression models that are made necessary when strongly trended nonstationary variables appear in them Some of the concepts to be discussed here were introduced in Section 12 2 Section 12 2 also contains a cursory introduction to the nature of time series processes It will be useful to review that material before proceeding with the rest of this chapter Finally Sections 15 9 1 on estimation and 15 9 2 and 19 4 3 on stability of dynamic models will be especially useful for the latter sections of this chapter

    1 Each

    topic discussed here is the subject of a vast literature with articles and book length treatments at all levels For example two survey papers on the subject of unit roots in economic time series data Diebold and Nerlove 1990 and Campbell and Perron 1991 cite between them over 200 basic sources on the subject The literature on unit roots and cointegration is almost surely the most rapidly moving target in econometrics Stock s 1994 survey adds hundreds of references to those in the aforementioned surveys and brings the literature up to date as of then Useful basic references on the subjects of this chapter are Box and Jenkins 1984 Judge et al 1985 Mills 1990 Granger and Newbold 1996 Granger and Watson 1984 Hendry Pagan and Sargan 1984 Geweke 1984 and especially Harvey 1989 1990 Enders 1995 Hamilton 1994 and Patterson 2000 There are also many survey style and pedagogical articles on these subjects The aforementioned paper by Diebold and Nerlove is a useful tour guide through some of the literature We recommend Dickey Bell and Miller 1986 and Dickey Jansen and Thorton 1991 as well The latter is an especially clear introduction at a very basic level of the fundamental tools for empirical researchers

    608

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    609

    20 2

    STATIONARY STOCHASTIC PROCESSES

    The essential building block for the models to be discussed in this chapter is the white noise time series process t t where each element in the sequence has E t 0 E t2 e2 and Cov t s 0 for all s t Each element in the series is a random draw from a population with zero mean and constant variance It is occasionally assumed that the draws are independent or normally distributed although for most of our analysis neither assumption will be essential A univariate time series model describes the behavior of a variable in terms of its own past values Consider for example the autoregressive disturbance models introduced in Chapter 12 ut ut 1 t 20 1

    Autoregressive disturbances are generally the residual variation in a regression model built up from what may be an elaborate underlying theory yt xt ut The theory usually stops short of stating what enters the disturbance But the presumption that some time series process generates xt should extend equally to ut There are two ways to interpret this simple series As stated above ut equals the previous value of ut plus an innovation t Alternatively by manipulating the series we showed that ut could be interpreted as an aggregation of the entire history of the t s Occasionally statistical evidence is convincing that a more intricate process is at work in the disturbance Perhaps a second order autoregression ut 1 ut 1 2 ut 2 t 20 2

    better explains the movement of the disturbances in the regression The model may not arise naturally from an underlying behavioral theory But in the face of certain kinds of statistical evidence one might conclude that the more elaborate model would be preferable 2 This section will describe several alternatives to the AR 1 model that we have relied on in most of the preceding applications
    20 2 1 AUTOREGRESSIVE MOVING AVERAGE PROCESSES

    The variable yt in the model yt yt 1 t 20 3

    is said to be autoregressive or self regressive because under certain assumptions E yt yt 1 yt 1 A more general pth order autoregression or AR p process would be written yt 1 yt 1 2 yt 2 p yt p t
    2 For

    20 4

    example the estimates of t computed after a correction for rst order autocorrelation may fail tests of randomness such as the LM Section 12 7 1 test

    Greene 50240

    book

    June 27 2002

    21 11

    610

    CHAPTER 20 Time Series Models

    The analogy to the classical regression is clear Now consider the rst order moving average or MA 1 speci cation yt t t 1 By writing yt 1 L t or yt t 3 1 L 1 we nd that yt 1 2 yt 2 t 1 Once again the effect is to represent yt as a function of its own past values An extremely general model that encompasses 20 4 and 20 5 is the autoregressive moving average or ARMA p q model yt yt 1 yt 1 2 yt 2 p yt p t 1 t 1 q t q 20 6 Note the convention that the ARMA p q process has p autoregressive lagged dependent variable terms and q lagged moving average terms Researchers have found that models of this sort with relatively small values of p and q have proved quite effective as forecasting models The disturbances t are labeled the innovations in the model The term is tting because the only new information that enters the processes in period t is this innovation Consider then the AR 1 process yt yt 1 t Either by successive substitution or by using the lag operator we obtain 1 L yt t or yt 1
    i 0

    20 5

    20 7

    i t i

    4

    20 8

    The observed series is a particular type of aggregation of the history of the innovations The moving average MA q model yt t 1 t 1 q t q D L t 20 9

    is yet another particularly simple form of aggregation in that only information from the q most recent periods is retained The general result is that many time series processes can be viewed either as regressions on lagged values with additive disturbances or as
    3 The lag operator is discussed in Section 19 2 2 Since is a constant 1 L 1 2 1 The lag operator may be set equal to one when it operates on a constant 4 See

    Section 19 3 2 for discussion of models with in nite lag structures

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    611

    aggregations of a history of innovations They differ from one to the next in the form of that aggregation More involved processes can be similarly represented in either an autoregressive or moving average form We will turn to the mathematical requirements below Consider for example the ARMA 2 1 process yt 1 yt 1 2 yt 2 t t 1 which we can write as 1 L t yt 1 yt 1 2 yt 2 If 1 then we can divide both sides of the equation by 1 L and obtain


    t
    i 0

    i yt i 1 yt i 1 2 yt i 2

    After some tedious manipulation this equation produces the autoregressive form yt where 1 1 and j j 1 j 1 2 j 2 j 2 3


    1



    i yt i t
    i 1

    20 10

    Alternatively by similar yet more tedious manipulation we would be able to write yt 1 L t 2 1 1 2 1 1 L 2 L 1 1 2 i t i
    i 0

    20 11

    In each case the weights i in the autoregressive form and i in the moving average form are complicated functions of the original parameters But nonetheless each is just an alternative representation of the same time series process that produces the current value of yt This result is a fundamental property of certain time series We will return to the issue after we formally de ne the assumption that we have used at several steps above that allows these transformations
    20 2 2 STATIONARITY AND INVERTIBILITY

    At several points in the preceding we have alluded to the notion of stationarity either directly or indirectly by making certain assumptions about the parameters in the model In Section 12 3 2 we characterized an AR 1 disturbance process ut ut 1 t as stationary if 1 and t is white noise Then E ut 0 for all t Var ut Cov ut us 2 1 2 t s 2 1 2 20 12

    If 1 then the variance and covariances are unde ned

    Greene 50240

    book

    June 27 2002

    21 11

    612

    CHAPTER 20 Time Series Models

    In the following we use t to denote the white noise innovations in the process The ARMA p q process will be denoted as in 20 6

    DEFINITION 20 1 Covariance Stationarity A stochastic process yt is weakly stationary or covariance stationary if it satis es the following requirements 5 1 2 3 E yt is independent of t Var yt is a nite positive constant independent of t Cov yt ys is a nite function of t s but not of t or s

    The third requirement is that the covariance between observations in the series is a function only of how far apart they are in time not the time at which they occur These properties clearly hold for the AR 1 process immediately above Whether they apply for the other models we have examined remains to be seen We de ne the autocovariance at lag k as k Cov yt yt k Note that k Cov yt yt k k Stationarity implies that autocovariances are a function of k but not of t For example in 20 12 we see that the autocovariances of the AR 1 process yt yt 1 t are Cov yt yt k k 2 1 2 k 0 1 20 13

    If 1 then this process is stationary For any MA q series yt t 1 t 1 q t q E yt E t 1 E t 1 q E t q
    2 2 Var yt 1 1 q 2

    20 14

    Cov yt yt 1 1 1 2 2 3 q 1 q 2 and so on until Cov yt yt q 1 q 1 1 q 2 Cov yt yt q q 2
    5 Strong

    stationarity requires that the joint distribution of all sets of observations yt yt 1 be invariant to when the observations are made For practical purposes in econometrics this statement is a theoretical ne point Although weak stationary suf ces for our applications we would not normally analyze weakly stationary time series that were not strongly stationary as well Indeed we often go even beyond this step and assume joint normality

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    613

    and for lags greater than q the autocovariances are zero It follows therefore that nite moving average processes are stationary regardless of the values of the parameters The MA 1 process yt t t 1 is an important special case that has Var yt 1 2 e2 1 e2 and k 0 for k 1 For the AR 1 process the stationarity requirement is that 1 which in turn implies that the variance of the moving average representation in 20 8 is nite Consider the AR 2 process yt 1 yt 1 2 yt 2 t Write this equation as C L yt t where C L 1 1 L 2 L2 Then if it is possible we invert this result to produce yt C L 1 t Whether the inversion of the polynomial in the lag operator leads to a convergent series depends on the values of 1 and 2 If so then the moving average representation will be


    yt
    i 0

    i t i

    so that


    Var yt
    i 0

    i2 2

    Whether this result is nite or not depends on whether the series of i s is exploding or converging For the AR 2 case the series converges if 2 1 1 2 1 and 2 1 1 6 For the more general case the autoregressive process is stationary if the roots of the characteristic equation C z 1 1 z 2 z2 p zp 0 have modulus greater than one or lie outside the unit circle 7 It follows that if a stochastic process is stationary it has an in nite moving average representation and if not it does not The AR 1 process is the simplest case The characteristic equation is C z 1 z 0
    6 This 7 The

    requirement restricts 1 2 to within a triangle with points at 2 1 2 1 and 0 1 roots may be complex See Sections 15 9 2 and 19 4 3 They are of the form a bi where i 1 2 b2 1 which de nes a The unit circle refers to the two dimensional set of values of a and b de ned by a circle centered at the origin with radius 1

    Greene 50240

    book

    June 27 2002

    21 11

    614

    CHAPTER 20 Time Series Models

    and its single root is 1 This root lies outside the unit circle if 1 which we saw earlier Finally consider the inversion of the moving average process in 20 9 and 20 10 Whether this inversion is possible depends on the coef cients in D L in the same fashion that stationarity hinges on the coef cients in C L This counterpart to stationarity of an autoregressive process is called invertibility For it to be possible to invert a movingaverage process to produce an autoregressive representation the roots of D L 0 must be outside the unit circle Notice for example that in 20 5 the inversion of the moving average process is possible only if 1 Since the characteristic equation for the MA 1 process is 1 L 0 the root is 1 which must be larger than one If the roots of the characteristic equation of a moving average process all lie outside the unit circle then the series is said to be invertible Note that invertibility has no bearing on the stationarity of a process All moving average processes with nite coef cients are stationary Whether an ARMA process is stationary or not depends only on the AR part of the model
    20 2 3 AUTOCORRELATIONS OF A STATIONARY STOCHASTIC PROCESS

    The function k Cov yt yt k is called the autocovariance function of the process yt The autocorrelation function or ACF is obtained by dividing by the variance 0 to obtain k k 1 k 1 0

    For a stationary process the ACF will be a function of k and the parameters of the process The ACF is a useful device for describing a time series process in much the same way that the moments are used to describe the distribution of a random variable One of the characteristics of a stationary stochastic process is an autocorrelation function that either abruptly drops to zero at some nite lag or eventually tapers off to zero The AR 1 process provides the simplest example since k k which is a geometric series that either declines monotonically from 0 1 if is positive or with a damped sawtooth pattern if is negative Note as well that for the process yt yt 1 t k k 1 k 1 which bears a noteworthy resemblance to the process itself For higher order autoregressive series the autocorrelations may decline monotonically or may progress in the fashion of a damped sine wave 8 Consider for example the second order autoregression where we assume without loss of generality that 0
    8 The behavior is a function of the roots of the characteristic equation This aspect is discussed in Section 15 9

    and especially 15 9 3

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    615

    since we are examining second moments in deviations from the mean yt 1 yt 1 2 yt 2 t If the process is stationary then Var yt Var yt s for all s Also Var yt Cov yt yt and Cov t yt s 0 if s 0 These relationships imply that 0 1 1 2 2 2 Now using additional lags we nd that 1 1 0 2 1 and 2 1 1 2 0 These three equations provide the solution 0 2 1 2 1 2 2 1 12 2 20 15

    The variance is unchanging so we can divide throughout by 0 to obtain the relationships for the autocorrelations 1 1 0 2 1 Since 0 1 1 1 1 2 Using the same procedure for additional lags we nd that 2 1 1 2 so 2 12 1 2 2 Generally then for lags of two or more k 1 k 1 2 k 2 Once again the autocorrelations follow the same difference equation as the series itself The behavior of this function depends on 1 2 and k although not in an obvious way The inherent behavior of the autocorrelation function can be deduced from the characteristic equation 9 For the second order process we are examining the autocorrelations are of the form k 1 1 z1 k 2 1 z2 k where the two roots are10 1 z
    1 2

    1

    12 4 2

    If the two roots are real then we know that their reciprocals will be less than one in absolute value so that k will be the sum of two terms that are decaying to zero If the two roots are complex then k will be the sum of two terms that are oscillating in the form of a damped sine wave
    9 The

    set of results that we would use to derive this result are exactly those we used in Section 19 4 3 to analyze the stability of a dynamic equation which makes sense of course since the equation linking the autocorrelations is a simple difference equation used the device in Section 19 4 4 to nd the characteristic roots For a second order equation the quadratic is easy to manipulate

    10 We

    Greene 50240

    book

    June 27 2002

    21 11

    616

    CHAPTER 20 Time Series Models

    Applications that involve autoregressions of order greater than two are relatively unusual Nonetheless higher order models can be handled in the same fashion For the AR p process yt 1 yt 1 2 yt 2 p yt p t the autocovariances will obey the Yule Walker equations 0 1 1 2 2 p p 2 1 1 0 2 1 p p 1 and so on The autocorrelations will once again follow the same difference equation as the original series k 1 k 1 2 k 2 p k p The ACF for a moving average process is very simple to obtain For the rst order process yt t t 1 0 1 2 2 1 2 then k 0 for k 1 Higher order processes appear similarly For the MA 2 process by multiplying out the terms and taking expectations we nd that
    2 2 0 1 1 2 2

    1 1 1 2 2 2 1 2 k 0 k 2 The pattern for the general MA q process yt t 1 t 1 2 t 2 q t q is analogous The signature of a moving average process is an autocorrelation function that abruptly drops to zero at one lag past the order of the process As we will explore below this sharp distinction provides a statistical tool that will help us distinguish between these two types of processes empirically The mixed process ARMA p q is more complicated since it is a mixture of the two forms For the ARMA 1 1 process yt yt 1 t t 1 the Yule Walker equations are 0 E yt yt 1 t t 1 1 2 2 2 1 0 2 and k k 1 k 1 The general characteristic of ARMA processes is that when the moving average component is of order q then in the series of autocorrelations there will be an initial q terms that are complicated functions of both the AR and MA parameters but after q periods k 1 k 1 2 k 2 p k p k q

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models 20 2 4 PARTIAL AUTOCORRELATIONS OF A STATIONARY STOCHASTIC PROCESS

    617

    The autocorrelation function ACF k gives the gross correlation between yt and yt k But as we saw in our analysis of the classical regression model in Section 3 4 a gross correlation such as this one can mask a completely different underlying relationship In this setting we observe for example that a correlation between yt and yt 2 could arise primarily because both variables are correlated with yt 1 Consider the AR 1 process yt yt 1 t The second gross autocorrelation is 2 2 But in the same spirit we might ask what is the correlation between yt and yt 2 net of the intervening effect of yt 1 In this model if we remove the effect of yt 1 from yt then only t remains and this disturbance is uncorrelated with yt 2 We would conclude that the partial autocorrelation between yt and yt 2 in this model is zero DEFINITION 20 2 Partial Autocorrelation Coef cient The partial correlation between yt and yt k is the simple correlation between yt k and yt minus that part explained linearly by the intervening lags That is
    k Corr yt E yt yt 1 yt k 1 yt k

    where E yt yt 1 yt k 1 is the minimum mean squared error predictor of yt by yt 1 yt k 1

    The function E might be the linear regression if the conditional mean happened to be linear but it might not The optimal linear predictor is the linear regression however so what we have is
    k Corr yt 1 yt 1 2 yt 2 k 1 yt k 1 yt k

    where 1 2 k 1 Var yt 1 yt 2 yt k 1 Cov yt yt 1 yt 2 yt k 1 This equation will be recognized as a vector of regression coef cients As such what we are computing here of course is the correlation between a vector of residuals and yt k There are various ways to formalize this computation see e g Enders 1995 pp 82 85 One intuitively appealing approach is suggested by the equivalent de nition which is also a prescription for computing it as follows

    1

    DEFINITION 20 3 Partial Autocorrelation Coef cient The partial correlation between yt and yt k is the last coef cient in the linear projection of yt on yt 1 yt 2 yt k 1 1 0 1 k 2 k 1 1 2 0 k 3 k 2 2 1 k 1 k k 1 k 2 1 0 k

    Greene 50240

    book

    June 27 2002

    21 11

    618

    CHAPTER 20 Time Series Models

    As before there are some distinctive patterns for particular time series processes Consider rst the autoregressive processes yt 1 yt 1 2 yt 2 p yt p t We are interested in the last coef cient in the projection of yt on yt 1 then on yt 1 yt 2 and so on The rst of these is the simple regression coef cient of yt on yt 1 so
    1

    Cov yt yt 1 1 1 Var yt 1 0

    The rst partial autocorrelation coef cient for any process equals the rst autocorrelation coef cient
    Without doing the messy algebra we also observe that for the AR p process 1 is a mixture of all the coef cients Of course if p equals 1 then 1 1 For the higher order processes the autocorrelations are likewise mixtures of the autoregressive coef cients until we reach In view of the form of the AR p model the last coef cient p in the linear projection on p lagged values is p Also we can see the signature pattern of the AR p process any additional partial autocorrelations must be zero because they will be simply k Corr t yt k 0 if k p Combining results thus far we have the characteristic pattern for an autoregressive process The ACF k will gradually decay to zero either monotonically if the charac teristic roots are real or in a sinusoidal pattern if they are complex The PACF k will be irregular out to lag p when they abruptly drop to zero and remain there The moving average process has the mirror image of this pattern We have already examined the ACF for the MA q process it has q irregular spikes then it falls to zero and stays there For the PACF write the model as

    yt 1 1 L 2 L2 q Lq t If the series is invertible which we will assume throughout then we have yt t 1 1 L q Lq or yt 1 yt 1 2 yt 2 t



    i 1

    i yt i t

    The autoregressive form of the MA q process has an in nite number of terms which means that the PACF will not fall off to zero the way that the PACF of the AR process does Rather the PACF of an MA process will resemble the ACF of an AR process For example for the MA 1 process yt t t 1 the AR representation is yt yt 1 2 yt 2 t which is the familiar form of an AR 1 process Thus the PACF of an MA 1 process is identical to the ACF of an AR 1 process k k The ARMA p q is a mixture of the two types of processes so its ACF and PACF are likewise mixtures of the two forms discussed above Generalities are dif cult to

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    619

    draw but normally the ACF of an ARMA process will have a few distinctive spikes in the early lags corresponding to the number of MA terms followed by the characteristic smooth pattern of the AR part of the model High order MA processes are relatively uncommon in general and high order AR processes greater than two seem primarily to arise in the form of the nonstationary processes described in the next section For a stationary process the workhorses of the applied literature are the 2 0 and 1 1 processes For the ARMA 1 1 process both the ACF and the PACF will display a distinctive spike at lag 1 followed by an exponentially decaying pattern thereafter
    20 2 5 MODELING UNIVARIATE TIME SERIES

    The preceding discussion is largely descriptive There is no underlying economic theory that states why a compact ARMA p q representation should adequately describe the movement of a given economic time series Nonetheless as a methodology for building forecasting models this set of tools and its empirical counterpart have proved as good as and even superior to much more elaborate speci cations perhaps to the consternation of the builders of large macroeconomic models 11 Box and Jenkins 1984 pioneered a forecasting framework based on the preceding that has been used in a great many elds and that has certainly in terms of numbers of applications largely supplanted the use of large integrated econometric models Box and Jenkins s approach to modeling a stochastic process can be motivated by the following

    THEOREM 20 1 Wold s Decomposition Theorem Every zero mean covariance stationary stochastic process can be represented in the form yt E yt yt 1 yt 2 tt p
    i 0

    i t i

    where t is white noise 0 1 and the weights are square summable that is


    i2
    i 1

    E yt yt 1 yt 2 yt p is the optimal linear predictor of yt based on its lagged values and the predictor Et is uncorrelated with t i



    Thus the theorem decomposes the process generating yt into Et E yt yt 1 yt 2 yt p the linearly deterministic component
    11 This observation can be overstated Even the most committed advocate of the Box Jenkins methods would concede that an ARMA model of for example housing starts will do little to reveal the link between the interest rate policies of the Federal Reserve and their variable of interest That is the covariation of economic variables remains as interesting as ever

    Greene 50240

    book

    June 27 2002

    21 11

    620

    CHAPTER 20 Time Series Models

    and


    i t i the linearly indeterministic component
    i 0

    The theorem states that for any stationary stochastic process for a given choice of p there is a Wold representation of the stationary series
    p

    yt
    i 1

    i yt i
    i 0

    i t i

    Note that for a speci c ARMA P Q process if p P then i 0 for i Q For practical purposes the problem with the Wold representation is that we cannot estimate the in nite number of parameters needed to produce the full right hand side and of course P and Q are unknown The compromise then is to base an estimate of the representation on a model with a nite number of moving average terms We can seek the one that best ts the data in hand It is important to note that neither the ARMA representation of a process nor the Wold representation is unique In general terms suppose that the process generating yt is L yt L t

    We assume that L is nite but L need not be Let L be some other polynomial in the lag operator with roots that are outside the unit circle Then L L or L yt L t L yt L L L t

    The new representation is fully equivalent to the old one but it might have a different number of autoregressive parameters which is exactly the point of the Wold decomposition The implication is that part of the model building process will be to determine the lag structures Further discussion on the methodology is given by Box and Jenkins 1984 The Box Jenkins approach to modeling stochastic processes consists of the following steps 1 Satisfactorily transform the data so as to obtain a stationary series This step will usually mean taking rst differences logs or both to obtain a series whose autocorrelation function eventually displays the characteristic exponential decay of a stationary series Estimate the parameters of the resulting ARMA model generally by nonlinear least squares Generate the set of residuals from the estimated model and verify that they satisfactorily resemble a white noise series If not respecify the model and return to step 2 The model can now be used for forecasting purposes

    2 3

    4

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    621

    Space limitations prevent us from giving a full presentation of the set of techniques Since this methodology has spawned a mini industry of its own however there is no shortage of book length analyses and prescriptions to which the reader may refer Five to consider are the canonical source Box and Jenkins 1984 Granger and Newbold 1986 Mills 1993 Enders 1995 and Patterson 2000 Some of the aspects of the estimation and analysis steps do have broader relevance for our work here so we will continue to examine them in some detail
    20 2 6 ESTIMATION OF THE PARAMETERS OF A UNIVARIATE TIME SERIES

    The broad problem of regression estimation with time series data which carries through to all the discussions of this chapter is that the consistency and asymptotic normality results that we derived based on random sampling will no longer apply For example for a stationary series we have assumed that Var yt 0 regardless of t But we have yet to establish that an estimated variance c0 1 T 1
    T

    yt y 2
    t 1

    will converge to 0 or anything else for that matter It is necessary to assume that the process is ergodic We rst encountered this assumption in Section 12 4 1 see De nition 12 3 Ergodicity is a crucial element of our theory of estimation When a time series has this property with stationarity then we can consider estimation of parameters in a meaningful sense If the process is stationary and ergodic then by the Ergodic Theorem Theorems 12 1 and 12 2 moments such as y and c0 converge to their population counterparts and 0 12 The essential component of the condition is one that we have met at many points in this discussion that autocovariances must decline suf ciently rapidly as the separation in time increases It is possible to construct theoretical examples of processes that are stationary but not ergodic but for practical purposes a stationarity assumption will be suf cient for us to proceed with estimation For example in our models of stationary processes if we assume that t N 0 2 which is common then the stationary processes are ergodic as well Estimation of the parameters of a time series process must begin with a determination of the type of process that we have in hand Box and Jenkins label this the identi cation step But identi cation is a term of art in econometrics so we will steer around that admittedly standard name For this purpose the empirical estimates of the autocorrelation and partial autocorrelation functions are useful tools The sample counterpart to the ACF is the correlogram rk
    T t k 1

    yt y yt k y
    T t 1

    yt y 2



    A plot of rk against k provides a description of a process and can be used to help discern what type of process is generating the data The sample PACF is the counterpart to the
    12 The

    formal conditions for ergodicity are quite involved see Davidson and MacKinnon 1993 or Hamilton 1994 Chapter 7

    Greene 50240

    book

    June 27 2002

    21 11

    622

    CHAPTER 20 Time Series Models

    ACF but net of the intervening lags that is
    rk T t k 1 yt yt k T 2 t k 1 yt k

    where yt and yt k are residuals from the regressions of yt and yt k on 1 yt 1 yt 2 yt k 1 We have seen this at many points before rk is simply the last linear least squares regression coef cient in the regression of yt on 1 yt 1 yt 2 yt k 1 yt k Plots of the ACF and PACF of a series are usually presented together Since the sample estimates of the autocorrelations and partial autocorrelations are not likely to be identically zero even when the population values are we use diagnostic tests to discern whether a time series appears to be nonautocorrelated 13 Individual sample autocorrelations will be approximately distributed with mean zero and variance 1 T under the hypothesis that the series is white noise The Box Pierce 1970 statistic
    p

    Q T
    k 1

    2 rk

    is commonly used to test whether a series is white noise Under the null hypothesis that the series is white noise Q has a limiting chi squared distribution with p degrees of freedom A re nement that appears to have better nite sample properties is the Ljung Box 1979 statistic
    p

    Q T T 2
    k 1

    2 rk T k

    The limiting distribution of Q is the same as that of Q The process of nding the appropriate speci cation is essentially trial and error An initial speci cation based on the sample ACF and PACF can be found The parameters of the model can then be estimated by least squares For pure AR p processes the estimation step is simple The parameters can be estimated by linear least squares If there are moving average terms then linear least squares is inconsistent but the parameters of the model can be t by nonlinear least squares Once the model has been estimated a set of residuals is computed to assess the adequacy of the speci cation In an AR model the residuals are just the deviations from the regression line The adequacy of the speci cation can be examined by applying the foregoing techniques to the estimated residuals If they appear satisfactorily to mimic a white noise process then analysis can proceed to the forecasting step If not a new speci cation should be considered
    Example 20 1 ACF and PACF for a Series of Bond Yields

    Appendix Table F20 1 lists 5 years of monthly averages of the yield on a Moody s Aaa rated corporate bond The series is plotted in Figure 20 1 From the gure it would appear that stationarity may not be a reasonable assumption We will return to this question below The ACF and PACF for the original series are shown in Table 20 1 with the diagnostic statistics discussed earlier The plots appear to be consistent with an AR 2 process although the ACF at longer lags seems a bit more persistent than might have been expected Once again this condition
    13 The

    LM test discussed in Section 12 7 1 is one of these

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    623

    may indicate that the series is not stationary Maintaining that assumption for the present we computed the residuals from the AR 2 model and subjected them to the same tests as the original series The coef cients of the AR 2 model are 1 1566 and 0 2083 which also satisfy the restrictions for stationarity given in Section 20 2 2 Despite the earlier suggestions the residuals do appear to resemble a white noise series Table 20 2
    FIGURE 20 1 Monthly Data on Bond Yields

    1 00 0 95 0 90 Yield 10 0 85 0 80 0 75 0 70 0 65 1990 1

    1990 10

    1991 8

    1992 6 Month

    1993 4

    1994 2

    1994 12

    TABLE 20 1

    ACF and PACF for Bond Yields

    Time series identi cation for YIELD Box Pierce statistic 323 0587 Box Ljung Statistic 317 4389 Degrees of freedom 14 Degrees of freedom 14 Signi cance level 0 0000 Signi cance level 0 0000 N coef cient 2 sqrt N or 95 signi cant
    Autocorrelation Function Lag 1 0 1 Box Pierce Partial Autocorrelations 1 0 1

    1 2 3 4 5 6 7 8 9 10 11 12 13 14

    0 970N 0 908N 0 840N 0 775N 0 708N 0 636N 0 567N 0 501N 0 439N 0 395N 0 370N 0 354N 0 339N 0 331N

    56 42N 105 93N 148 29N 184 29N 214 35N 238 65N 257 93N 272 97N 284 51N 293 85N 302 08N 309 58N 316 48N 323 06N

    0 970N 0 573N 0 157 0 043 0 309N 0 024 0 037 0 059 0 068 0 216 0 180 0 048 0 162 0 171

    Greene 50240

    book

    June 27 2002

    21 11

    624

    CHAPTER 20 Time Series Models

    TABLE 20 2

    ACF and PACF for Residuals

    Time series identi cation for U Box Pierce statistic 13 7712 Box Ljung statistic 16 1336 Signi cance level 0 4669 Signi cance level 0 3053 N coef cient 2 sqrt N or 95 signi cant
    Autocorrelation Function Lag 1 0 1 Box Pierce Partial Autocorrelations 1 0 1

    1 2 3 4 5 6 7 8 9 10 11 12 13 14

    0 154 0 147 0 207 0 161 0 117 0 114 0 110 0 041 0 168 0 014 0 016 0 009 0 195 0 125

    1 38 2 64 5 13 6 64 7 43 8 18 8 89 8 99 10 63 10 64 10 66 10 66 12 87 13 77

    0 154 0 170 0 179 0 183 0 068 0 094 0 066 0 125 0 258 0 035 0 015 0 089 0 166 0 132

    20 2 7

    THE FREQUENCY DOMAIN

    For the analysis of macroeconomic ow data such as output and consumption and aggregate economic index series such as the price level and the rate of unemployment the tools described in the previous sections have proved quite satisfactory The low frequency of observation yearly quarterly or occasionally monthly and very signi cant aggregation both across time and of individuals make these data relatively smooth and straightforward to analyze Much contemporary economic analysis especially nancial econometrics has dealt with more disaggregated microlevel data observed at far greater frequency Some important examples are stock market data for which daily returns data are routinely available and exchange rate movements which have been tabulated on an almost continuous basis In these settings analysts have found that the tools of spectral analysis and the frequency domain have provided many useful results and have been applied to great advantage This section introduces a small amount of the terminology of spectral analysis to acquaint the reader with a few basic features of the technique For those who desire further detail Fuller 1976 Granger and Newbold 1996 Hamilton 1994 Chat eld 1996 Shumway 1988 and Hatanaka 1996 among many others with direct application in economics are excellent introductions Most of the following is based on Chapter 6 of Hamilton 1994 In this framework we view an observed time series as a weighted sum of underlying series that have different cyclical patterns For example aggregate retail sales and construction data display several different kinds of cyclical variation including a regular seasonal pattern and longer frequency variation associated with variation in the economy as a whole over the business cycle The total variance of an observed time series may thus be viewed as a sum of the contributions of these underlying series which vary

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    625

    at different frequencies The standard application we consider is how spectral analysis is used to decompose the variance of a time series
    20 2 7 a Theoretical Results

    Let yt t de ne a zero mean stationary time series process The autocovariance at lag k was de ned in Section 20 2 2 as k k Cov yt yt k We assume that the series k is absolutely summable ance generating function for this time series process is
    i 0

    k is nite The autocovari

    gY z
    k

    k zk

    We evaluate this function at the complex value z exp i where i 1 and is a real number and divide by 2 to obtain the spectrum or spectral density function of the time series process h Y 1 2
    k

    ke i k

    20 16

    The spectral density function is a characteristic of the time series process very much like the sequence of autocovariances or the sequence of moments for a probability distribution For a time series process that has the set of autocovariances k the spectral density can be computed at any particular value of Several results can be combined to simplify hY 1 2 3 4 Symmetry of the autocovariances k k DeMoivre s theorem exp i k cos k i sin k Polar values cos 0 1 cos 0 sin 0 0 sin 1 Symmetries of sin and cos functions sin sin and cos cos

    One of the convenient consequences of result 2 is exp i k exp i k 2 cos k which is always real These equations can be combined to simplify the spectrum hY 1 0 2 2


    k cos k 0
    k 1

    20 17

    This is a strictly real valued continuous function of Since the cosine function is cyclic with period 2 hY hY M2 for any integer M which implies that the entire spectrum is known if its values for from 0 to are known Since cos cos hY hY so the values of the spectrum for from 0 to are the same as those from 0 to There is also a correspondence between the spectrum and the autocovariances k


    h Y cos k d

    which we can interpret as indicating that the sequence of autocovariances and the spectral density function just produce two different ways of looking at the same

    Greene 50240

    book

    June 27 2002

    21 11

    626

    CHAPTER 20 Time Series Models

    time series process in the rst case in the time domain and in the second case in the frequency domain hence the name for this analysis The spectral density function is a function of the in nite sequence of autocovariances For ARMA processes however the autocovariances are functions of the usually small numbers of parameters so hY will generally simplify considerably For the ARMA p q process de ned in 20 6 yt 1 yt 1 p yt p t 1 t 1 q t q or L yt the autocovariance generating function is gY z 2 z 1 z 2 z 1 z z 1 z L t

    where z gives the sequence of coef cients in the in nite moving average representation of the series z z See for example 201 where this result is derived for the ARMA 2 1 process In some cases this result can be used explicitly to derive the spectral density function The spectral density function can be obtained from this relationship through hY
    Example 20 2

    2 2

    e i ei

    For an AR 1 process with autoregressive parameter yt yt 1 t t N 0 1 the lag polynomials are z 1 and z 1 z The autocovariance generating function is gY z 2 1 z 1 z 1 2 2 z 1 z


    Spectral Density Function for an AR 1 Process

    2 1 2

    i 0

    1 2

    i

    1 z2 z

    i



    The spectral density function is hY 2 1 2 1 2 1 exp i 1 exp i 2 1 2 2 cos

    For the general case suggested at the outset L yt L t there is a template we can use which if not simple is at least transparent Let i be the reciprocal of a root of the characteristic polynomial for the autoregressive part of the model i 0 i 1 p and let j j 1 q be the same for the moving average part of the model Then hY 2 2
    q j 1 p i 1

    1 2 2 j cos j 1 i2 2 i cos



    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    627

    Some of the roots of either polynomial may be complex pairs but in this case the product for adjacent pairs a bi is real so the function is always real valued Note also that a bi 1 a bi a 2 b2 For purposes of our initial objective decomposing the variance of the time series our nal useful theoretical result is


    hY d 0

    Thus the total variance can be viewed as the sum of the spectral densities over all possible frequencies More precisely it is the area under the spectral density Once again exploiting the symmetry of the cosine function we can rewrite this equation in the form


    2
    0

    hY d 0

    Consider then integration over only some of the frequencies 2 0
    j 0

    hY d j 0 j 0 j 1

    Thus j can be interpreted as the proportion of the total variance of the time series that is associated with frequencies less than or equal to j
    20 2 7 b Empirical Counterparts

    We have in hand a sample of observations yt t 1 T The rst task is to establish a correspondence between the frequencies 0 and something of interest in the sample The lowest frequency we could observe would be once in the entire sample period so we map 1 to 2 T The highest would then be T 2 and the intervening values will be 2 j T j 2 T 1 It may be more convenient to think in terms of period rather than frequency The number of periods per cycle will correspond to T j 2 j Thus the lowest frequency 1 corresponds to the highest period T dates months quarters years etc There are a number of ways to estimate the population spectral density function The obvious way is the sample counterpart to the population spectrum The sample of T observations provides the variance and T 1 distinct sample autocovariances ck c k 1 T
    T

    yt y yt k y
    t k 1

    y

    1 T

    T

    yt k 0 1 T 1
    t 1

    so we can compute the sample periodogram which is hY 1 c0 2 2
    T 1

    ck cos k
    k 1

    The sample periodogram is a natural estimator of the spectrum but it has a statistical aw With the sample variance and the T 1 autocovariances we are estimating T parameters with T observations The periodogram is in the end T transformations of these T estimates As such there are no degrees of freedom the estimator does not improve as the sample size increases A number of methods have been suggested for improving the behavior of the estimator Two common ways are truncation and

    Greene 50240

    book

    June 27 2002

    21 11

    628

    CHAPTER 20 Time Series Models

    windowing see Chat eld 1996 pp 139 143 The truncated estimator of the periodogram is based on a subset of the rst L T autocovariances The choice of L is a problem because there is no theoretical guidance Chat eld 1996 suggests L approximately equal to 2 T is large enough to provide resolution while removing some of the sampling variation induced by the long lags in the untruncated estimator The second mechanism for improving the properties of the estimator is a set of weights called a lag window The revised estimator is hY 1 w 0 c0 2 2
    L

    wkck cos k
    k 1

    where the set of weights wk k 0 L is the lag window One choice for the weights is the Bartlett window which produces hY Bartlett 1 c0 2 2
    L

    w k L ck cos k
    k 1

    w k L 1

    k L 1

    Note that this result is the same set of weights used in the Newey West robust covariance matrix estimator in Chapter 12 with essentially the same motivation Two others that are commonly used are the Tukey window which has wk 1 1 cos k L and the 2 Parzen window wk 1 6 k L 2 k L 3 if k L 2 and wk 2 1 k L 3 otherwise If the series has been modeled as an ARMA process we can instead compute the fully parametric estimator based on our sample estimates of the roots of the autoregressive and moving average polynomials This second estimator would be hY ARMA 2 2
    q j 1 p i 1

    1 d2 2d j cos k j 1 ai2 2ai cos k



    Others have been suggested See Chat eld 1996 Chap 7 Finally with the empirical estimate of the spectrum the variance decomposition can be approximated by summing the values around the frequencies of interest
    Example 20 3 Spectral Analysis of the Growth Rate of Real GNP

    Appendix Table F20 2 lists quarterly observations on U S GNP and the implicit price de ator for GNP for 1950 through 1983 The GNP series with its upward trend is obviously nonstationary We will analyze instead the quarterly growth rate 100 log GNPt pricet log GNPt 1 pricet 1 Figure 20 2 shows the resulting data The differenced series has 135 observations Figure 20 3 plots the sample periodogram with frequencies scaled so that j j T 2 The gure shows the sample periodogram for j 1 67 since values of the spectrum for j 68 134 are a mirror image of the rst half we have omitted them Figure 20 3 shows peaks at several frequencies The effect is more easily visualized in terms of the periods of these cyclical components The second row of labels shows the periods computed as quarters T 2 j where T 67 quarters There are distinct masses around 2 to 3 years that correspond roughly to the business cycle of this era One might also expect seasonal effects in these quarterly data and there are discernible spikes in the periodogram at about 0 3 year one quarter These spikes however are minor compared with the other effects in the gure This is to be expected because the data are seasonally adjusted already Finally there is a pronounced spike at about 6 years in the periodogram The original data in Figure 20 2 do seem consistent with this result with substantial recessions coming at intervals of 5 to 7 years from 1953 to 1980 To underscore these results consider what we would obtain if we analyzed the original log real GNP series instead of the growth rates Figure 20 4 shows the raw data Although there does appear to be some short run high frequency variation around a long run trend

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    629

    for example the cyclical variation of this series is obviously dominated by the upward trend If this series were viewed as a single periodic series then we would surmise that the period of this cycle would be the entire sample interval The frequency of the dominant part of this time series seems to be quite close to zero The periodogram for this series shown in Figure 20 5 is consistent with that suspicion By far the largest component of the spectrum is provided by frequencies close to zero

    FIGURE 20 2

    Growth Rate of U S Real GNP Quarterly 1953 to 1984

    6

    4

    GNPGRWTH

    2

    0

    2

    4 1950

    1955

    1960

    1965 1970 Quarter

    1975

    1980

    1985

    FIGURE 20 3

    Sample Periodogram

    1 5

    1 2

    SPECTRUM

    0 9

    0 6

    0 3

    0 0 34 14 2 4 28 1 2 42 0 8 56 0 6 0 5 70 j qtrs

    Greene 50240

    book

    June 27 2002

    21 11

    630

    CHAPTER 20 Time Series Models

    0 6 0 4 0 2 REAL GNP 0 0 0 2 0 4 0 6 0 8 1950

    1955

    1960

    1965 1970 Quarter

    1975

    1980

    1985

    FIGURE 20 4

    Quarterly Data on Real GNP

    0 8 0 7 0 6 0 5 SPECTRUM 0 4 0 3 0 2 0 1 0 0 0 1 0
    FIGURE 20 5

    14

    28 k

    42

    56

    70

    Spectrum for Real GNP

    A Computational Note The computation in 20 16 or 20 17 is the discrete Fourier transform of the series of autocovariances In principle it involves an enormous amount of computation on the order of T 2 sets of computations For ordinary time series involving up to a few hundred observations this work is not particularly onerous The preceding computations involving 135 observations took a total of perhaps 20 seconds of

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    631

    computing time For series involving multiple thousands of observations such as daily market returns or far more such as in recorded exchange rates and forward premiums the amount of computation could become prohibitive However the computation can be done using an important tool the fast Fourier transform FFT that reduces the computational level to O T log2 T which is many orders of magnitude less than T 2 The FFT is programmed in some econometric software packages such as RATS and Matlab See Press et al 1986 for further discussion

    20 3

    NONSTATIONARY PROCESSES AND UNIT ROOTS

    Most economic variables that exhibit strong trends such as GDP consumption or the price level are not stationary and are thus not amenable to the analysis of the previous section In many cases stationarity can be achieved by simple differencing or some other transformation But new statistical issues arise in analyzing nonstationary series that are understated by this super cial observation
    20 3 1 INTEGRATED PROCESSES AND DIFFERENCING

    A process that gures prominently in recent work is the random walk with drift yt yt 1 t By direct substitution


    yt
    i 0

    t i

    That is yt is the simple sum of what will eventually be an in nite number of random variables possibly with nonzero mean If the innovations are being generated by the same zero mean constant variance distribution then the variance of yt would obviously be in nite As such the random walk is clearly a nonstationary process even if equals zero On the other hand the rst difference of yt zt yt yt 1 t is simply the innovation plus the mean of zt which we have already assumed is stationary The series yt is said to be integrated of order one denoted I 1 because taking a rst difference produces a stationary process A nonstationary series is integrated of order d denoted I d if it becomes stationary after being rst differenced d times A further generalization of the ARMA model discussed in Section 20 2 1 would be the series zt 1 L d yt
    d

    yt

    Greene 50240

    book

    June 27 2002

    21 11

    632

    CHAPTER 20 Time Series Models

    The resulting model is denoted an autoregressive integrated moving average model or ARIMA p d q 14 In full the model would be
    d

    yt 1

    d

    yt 1 2

    d

    yt 2 p

    d

    yt p t 1 t 1 q t q

    where yt yt yt 1 1 L yt This result may be written compactly as C L 1 L d yt D L t where C L and D L are the polynomials in the lag operator and 1 L d yt d yt is the dth difference of yt An I 1 series in its raw undifferenced form will typically be constantly growing or wandering about with no tendency to revert to a xed mean Most macroeconomic ows and stocks that relate to population size such as output or employment are I 1 An I 2 series is growing at an ever increasing rate The price level data in Appendix Table F20 2 and shown below appear to be I 2 Series that are I 3 or greater are extremely unusual but they do exist Among the few manifestly I 3 series that could be listed one would nd for example the money stocks or price levels in hyperin ationary economies such as interwar Germany or Hungary after World War II
    Example 20 4 A Nonstationary Series

    The nominal GDP and price de ator variables in Appendix Table F20 2 are strongly trended so the mean is changing over time Figures 20 6 through 20 8 plot the log of the GDP de ator series in Table F20 2 and its rst and second differences The original series and rst differences are obviously nonstationary but the second differencing appears to have rendered the series stationary The rst 10 autocorrelations of the log of the GDP de ator series are shown in Table 20 3 The autocorrelations of the original series show the signature of a strongly trended nonstationary series The rst difference also exhibits nonstationarity because the autocorrelations are still very large after a lag of 10 periods The second difference appears to be stationary with mild negative autocorrelation at the rst lag but essentially none after that Intuition might suggest that further differencing would reduce the autocorrelation further but it would be incorrect We leave as an exercise to show that in fact for values of less than about 0 5 rst differencing of an AR 1 process actually increases autocorrelation

    20 3 2

    RANDOM WALKS TRENDS AND SPURIOUS REGRESSIONS

    In a seminal paper Granger and Newbold 1974 argued that researchers had not paid suf cient attention to the warning of very high autocorrelation in the residuals from conventional regression models Among their conclusions were that macroeconomic data as a rule were integrated and that in regressions involving the levels of such data the standard signi cance tests were usually misleading The conventional t and F tests would tend to reject the hypothesis of no relationship when in fact there might be none
    are yet further re nements one might consider such as removing seasonal effects from zt by differencing by quarter or month See Harvey 1990 and Davidson and MacKinnon 1993 Some recent work has relaxed the assumption that d is an integer The fractionally integrated series or ARFIMA has been used to model series in which the very long run multipliers decay more slowly than would be predicted otherwise See Section 20 3 5
    14 There

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    633

    5 40 5 20 5 00 Logprice 4 80 4 60 4 40 4 20 4 00 1950
    FIGURE 20 6

    1957

    1964 Quarter

    1971

    1978

    1985

    Quarterly Data on log GDP De ator

    0300 0258 0215 0172 0129 0086 0043 0000 1950
    FIGURE 20 7

    d log p

    1957

    1964 Quarter

    1971

    1978

    1985

    First Difference of log GDP De ator

    The general result at the center of these ndings is that conventional linear regression ignoring serial correlation of one random walk on another is virtually certain to suggest a signi cant relationship even if the two are in fact independent Among their extreme conclusions Granger and Newbold suggested that researchers use a critical t value of 11 2 rather than the standard normal value of 1 96 to assess the signi cance of a

    Greene 50240

    book

    June 27 2002

    21 11

    634

    CHAPTER 20 Time Series Models

    020 016 012 d 2 log p 008 004 000 004 008 1950
    FIGURE 20 8

    1957

    1964 Quarter

    1971

    1978

    1985

    Second Difference of log GNP De ator

    TABLE 20 3 Lag

    Autocorrelations for ln GNP De ator
    Autocorrelation Function First Difference of log Price Autocorrelation Function Second Difference of log Price

    Autocorrelation Function Original Series log Price

    1 2 3 4 5 6 7 8 9 10

    1 000 1 000 0 999 0 999 0 999 0 998 0 998 0 997 0 997 0 997

    0 812 0 765 0 776 0 682 0 631 0 592 0 523 0 513 0 488 0 491

    0 395 0 112 0 258 0 101 0 022 0 076 0 163 0 052 0 054 0 062

    coef cient estimate Phillips 1986 took strong issue with this conclusion Based on a more general model and on an analytical rather than a Monte Carlo approach he suggested that the normalized statistic t T be used for testing purposes rather than t itself For the 50 observations used by Granger and Newbold the appropriate critical value would be close to 15 If anything Granger and Newbold were too optimistic The random walk with drift zt zt 1 t and the trend stationary process zt t t 20 19 20 18

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    635

    where in both cases ut is a white noise process appear to be reasonable characterizations of many macroeconomic time series 15 Clearly both of these will produce strongly trended nonstationary series 16 so it is not surprising that regressions involving such variables almost always produce signi cant relationships The strong correlation would seem to be a consequence of the underlying trend whether or not there really is any regression at work But Granger and Newbold went a step further The intuition is less clear if there is a pure random walk at work zt zt 1 t 20 20

    but even here they found that regression relationships appear to persist even in unrelated series Each of these three series is characterized by a unit root In each case the datagenerating process DGP can be written 1 L zt t 20 21

    where and 0 respectively and vt is a stationary process Thus the characteristic equation has a single root equal to one hence the name The upshot of Granger and Newbold s and Phillips s ndings is that the use of data characterized by unit roots has the potential to lead to serious errors in inferences In all three settings differencing or detrending would seem to be a natural rst step On the other hand it is not going to be immediately obvious which is the correct way to proceed the data are strongly trended in all three cases and taking the incorrect approach will not necessarily improve matters For example rst differencing in 20 18 or 20 20 produces a white noise series but rst differencing in 20 19 trades the trend for autocorrelation in the form of an MA 1 process On the other hand detrending that is computing the residuals from a regression on time is obviously counterproductive in 20 18 and 20 20 even though the regression of zt on a trend will appear to be signi cant for the reasons we have been discussing whereas detrending in 21 19 appears to be the right approach 17 Since none of these approaches is likely to be obviously preferable at the outset some means of choosing is necessary Consider nesting all three models in a single equation zt t zt 1 t Now subtract zt 1 from both sides of the equation and introduce the arti cial parameter zt zt 1 t 1 zt 1 t 0 1 t 1 zt 1 t
    15 The

    20 22

    analysis to follow has been extended to more general disturbance processes but that complicates matters substantially In this case in fact our assumption does cost considerable generality but the extension is beyond the scope of our work Some references on the subject are Phillips and Perron 1988 and Davidson and MacKinnon 1993 constant term produces the deterministic trend in the random walk with drift For convenience t t s t Thus zt consists of suppose that the process starts at time zero Then zt s 0 s 0 s a deterministic trend plus a stochastic trend consisting of the sum of the innovations The result is a variable with increasing variance around a linear trend Nelson and Kang 1984

    16 The

    17 See

    Greene 50240

    book

    June 27 2002

    21 11

    636

    CHAPTER 20 Time Series Models

    where by hypothesis 1 Equation 20 22 provides the basis for a variety of tests for unit roots in economic data In principle a test of the hypothesis that 1 equals zero gives con rmation of the random walk with drift since if equals 1 and 1 equals zero then 20 18 results If 1 is less than zero then the evidence favors the trend stationary or some other model and detrending or some alternative is the preferable approach The practical dif culty is that standard inference procedures based on least squares and the familiar test statistics are not valid in this setting The issue is discussed in the next section
    20 3 3 TESTS FOR UNIT ROOTS IN ECONOMIC DATA

    The implications of unit roots in macroeconomic data are at least potentially profound If a structural variable such as real output is truly I 1 then shocks to it will have permanent effects If con rmed then this observation would mandate some rather serious reconsideration of the analysis of macroeconomic policy For example the argument that a change in monetary policy could have a transitory effect on real output would vanish 18 The literature is not without its skeptics however This result rests on a razor s edge Although the literature is thick with tests that have failed to reject the hypothesis that 1 many have also not rejected the hypothesis that 0 95 and at 0 95 or even at 0 99 the entire issue becomes moot 19 Consider the simple AR 1 model with zero mean white noise innovations yt yt 1 t The downward bias of the least squares estimator when approaches one has been widely documented 20 For 1 however the least squares estimator c does have plim c and
    T t 2 yt yt 1 T 2 t 2 yt 1

    T c N 0 1 2

    d

    Does the result hold up if 1 The case is called the unit root case since in the ARMA representation C L yt t the characteristic equation 1 z 0 has one root equal to one That the limiting variance appears to go to zero should raise suspicions The literature on the questions dates back to Mann and Wald 1943 and Rubin 1950 But for econometric purposes the literature has a focal point at the celebrated papers of
    18 The

    1980s saw the appearance of literally hundreds of studies both theoretical and applied of unit roots in economic data An important example is the seminal paper by Nelson and Plosser 1982 There is little question but that this observation is an early part of the radical paradigm shift that has occurred in empirical macroeconomics large number of issues are raised in Maddala 1992 pp 582 588 for example Evans and Savin 1981 1984

    19 A

    20 See

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    637

    Dickey and Fuller 1979 1981 They showed that if equals one then T c v where v is a random variable with nite positive variance and in nite samples E c 1 21 There are two important implications in the Dickey Fuller results First the estimator of is biased downward if equals one Second the OLS estimator of converges to its probability limit more rapidly than the estimators to which we are accustomed That is the variance of c under the null hypothesis is O 1 T 2 not O 1 T In a mean squared error sense the OLS estimator is superconsistent It turns out that the implications of this nding for the regressions with trended data are considerable We have already observed that in some cases differencing or detrending is required to achieve stationarity of a series Suppose though that the AR 1 model above is t to an I 1 series despite that fact The upshot of the preceding discussion is that the conventional measures will tend to hide the true value of the sample estimate is biased downward and by dint of the very small true sampling variance the conventional t test will tend incorrectly to reject the hypothesis that 1 The practical solution to this problem devised by Dickey and Fuller was to derive through Monte Carlo methods an appropriate set of critical values for testing the hypothesis that equals one in an AR 1 regression when there truly is a unit root One of their general results is that the test may be carried out using a conventional t statistic but the critical values for the test must be revised the standard t table is inappropriate A number of variants of this form of testing procedure have been developed We will consider several of them
    20 3 4 THE DICKEY FULLER TESTS
    d

    The simplest version of the of the model to be analyzed is the random walk yt yt 1 t t N 0 2 and Cov t s 0 t s Under the null hypothesis that 1 there are two approaches to carrying out the test The conventional t ratio 1 DFt Est Std Error with the revised set of critical values may be used for a one sided test Critical values for this test are shown in the top panel of Table 20 4 Note that in general the critical value is considerably larger in absolute value than its counterpart from the t distribution The second approach is based on the statistic DF T 1 Critical values for this test are shown in the top panel of Table 20 4 The simple random walk model is inadequate for many series Consider the rate of in ation from 1950 2 to 2000 4 plotted in Figure 20 9 and the log of GDP over the same period plotted in Figure 20 10 The rst of these may be a random walk but it is
    21 A

    full derivation of this result is beyond the scope of this book For the interested reader a fairly comprehensive treatment at an accessible level is given in Chapter 17 of Hamilton 1994 pp 475 542

    Greene 50240

    book

    June 27 2002

    21 11

    638

    CHAPTER 20 Time Series Models

    TABLE 20 4

    Critical Values for the Dickey Fuller DF Test
    Sample Size 25 50 100

    F ratio D F F ratio standard AR modelb random walk 0 01 0 025 0 05 0 10 0 975

    a

    7 24 3 42 2 66 2 26 1 95 1 60 1 70

    6 73 3 20 2 62 2 25 1 95 1 61 1 66

    6 49 3 10 2 60 2 24 1 95 1 61 1 64 3 50 3 17 2 90 2 58 0 26 4 04 3 69 3 45 3 15 0 62

    6 25 3 00 2 58 2 23 1 95 1 62 1 62 3 42 3 12 2 86 2 57 0 23 3 96 3 66 3 41 3 13 0 66

    AR model with constant random walk with drift 0 01 3 75 3 59 0 025 3 33 3 23 0 05 2 99 2 93 0 10 2 64 2 60 0 975 0 34 0 29 AR model with constant and time trend trend stationary 0 01 4 38 4 15 0 025 3 95 3 80 0 05 3 60 3 50 0 10 3 24 3 18 0 975 0 50 0 58
    a From Dickey and Fuller 1981 p 1063 Degrees of b From Fuller 1976 p 373 and 1996 Table 10 A 2

    freedom are 2 and T p 3

    FIGURE 20 9

    Rate of In ation in the Consumer Price Index

    Rate of Inflation 1950 2 to 2000 4 20

    15

    Chg CPIU

    10

    5

    0

    5 1950

    1963

    1976 Quarter

    1989

    2002

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    639

    Log of GDP 1950 1 to 2000 4 9 25 9 00 8 75 8 50 LogGDP 8 25 8 00 7 75 7 50 7 25 1950
    FIGURE 20 10

    1963

    1976 Quarter

    1989

    2002

    Log of Gross Domestic Product

    clearly drifting The log GDP series in contrast has a strong trend For the rst of these a random walk with drift may be speci ed yt zt or yt 1 yt 1 t For the second type of series we may specify the trend stationary form yt t zt or yt 1 1 yt 1 t The tests for these forms may be carried out in the same fashion For the model with drift only the center panels of Tables 20 4 and 20 5 are used When the trend is included the lower panel of each table is used
    Example 20 5 Tests for Unit Roots

    zt zt 1 t

    zt zt 1 t

    In Section 19 6 8 we examined Cecchetti and Rich s study of the effect of recent monetary policy on the U S economy The data used in their study were the following variables y i m i m one period rate of in ation the rate of change in the CPI log of real GDP nominal interest rate the quarterly average yield on a 90 day T bill change in the log of the money stock M1 ex post real interest rate real growth in the money stock

    Greene 50240

    book

    June 27 2002

    21 11

    640

    CHAPTER 20 Time Series Models

    TABLE 20 5

    Critical Values for the Dickey Fuller DF Test
    Sample Size 25 50 100

    AR model random walk 0 01 11 8 0 025 9 3 0 05 7 3 0 10 5 3 0 975 1 78

    a

    12 8 9 9 7 7 5 5 1 69

    13 3 10 2 7 9 5 6 1 65 19 8 16 3 13 7 11 0 0 47 27 4 23 7 20 6 17 5 1 74

    13 8 10 5 8 1 5 7 1 60 20 7 16 9 14 1 11 3 0 41 29 4 24 4 21 7 18 3 1 81

    AR model with constant random walk with drift 0 01 17 2 18 9 0 025 14 6 15 7 0 05 12 5 13 3 0 10 10 2 10 7 0 975 0 65 0 53 AR model with constant and time trend trend stationary 0 01 22 5 25 8 0 025 20 0 22 4 0 05 17 9 19 7 0 10 15 6 16 8 0 975 1 53 1 667
    a From

    Fuller 1976 p 373 and 1996 Table 10 A 1

    Data used in their analysis were from the period 1959 1 to 1997 4 As part of their analysis they checked each of these series for a unit root and suggested that the hypothesis of a unit root could only be rejected for the last two variables We will reexamine these data for the longer interval 1950 2 to 2000 4 The data are in Appendix Table F5 1 Figures 20 11 to 20 14 show the behavior of the last four variables The rst two are shown above in Figures 20 9 and 20 10 Only the real output gure shows a strong trend so we will use the random walk with drift for all the variables except this one The Dickey Fuller tests are carried out in Table 20 6 There are 202 observations used in each one The rst observation is lost when computing the rate of in ation and the change in the money stock and one more is lost for the difference term in the regression The critical values from interpolating to the second row last column in each panel for 95 percent signi cance and a one tailed test are 3 70 and 24 2 respectively for DF and DF for the output equation which contains the time trend and 3 14 and 16 8 for the other equations which contain a constant but no trend For the output equation y the test statistics are 0 9584940384 1 DF 2 32 3 44 017880922 and DF 202 0 9584940384 1 8 38 21 2 Neither is less than the critical value so we conclude as have others that there is a unit root in the log GDP process The results of the other tests are shown in Table 20 6 Surprisingly these results do differ sharply from those obtained by Cecchetti and Rich 2001 for and m The sample period appears to matter if we repeat the computation using Cecchetti and Rich s interval 1959 4 to 1997 4 then DF equals 3 51 This is borderline but less contradictory For m we obtain a value of 4 204 for DF when the sample is restricted to the shorter interval

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    641

    16 14 12 10 8 6 4 2 0 1950
    FIGURE 20 11

    T bill Rate

    1963
    T Bill Rate

    1976 Quarter

    1989

    2002

    6 5 4 3 M1 2 1 0 1 2 1950
    FIGURE 20 12

    1963

    1976 Quarter

    1989

    2002

    Change in the Money Stock

    Greene 50240

    book

    June 27 2002

    21 11

    642

    CHAPTER 20 Time Series Models

    15

    10

    Real Interest Rate

    5

    0

    5

    10 15 1950
    FIGURE 20 13

    1963

    1976 Quarter

    1989

    2002

    Ex Post Real T Bill Rate

    10

    5

    0 Real M1

    5

    10

    15 20 1950
    FIGURE 20 14

    1963

    1976 Quarter

    1989

    2002

    Change in the Real Money Stock

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    643

    TABLE 20 6

    Unit Root Tests Standard errors of estimates in parentheses
    DF DF Conclusion

    y i m i m

    0 332 0 0696 0 320 0 134 0 228 0 109 0 448 0 0923 0 615 0 185 0 0700 0 0833 0 00033 0 00015

    0 659 0 0532 0 958 0 0179 0 961 0 0182 0 596 0 0573 0 557 0 0585 0 490 0 0618

    6 40 68 88 R2 0 432 s 0 643 2 35 8 48 R2 0 999 s 0 001 2 14 7 88 R2 0 933 s 0 743 7 05 81 61 R2 0 351 s 0 929 7 57 89 49 R2 0 311 s 2 395 8 25 103 02 R2 0 239 s 1 176

    Reject H0 Do not reject H0 Do not reject H0 Reject H0 Reject H0 Reject H0

    The Dickey Fuller tests described above assume that the disturbances in the model as stated are white noise An extension which will accommodate some forms of serial correlation is the augmented Dickey Fuller test The augmented Dickey Fuller test is the same one as above carried out in the context of the model yy t yt 1 1 yt 1 p yt p t The random walk form is obtained by imposing 0 and 0 the random walk with drift has 0 and the trend stationary model leaves both parameters free The two test statistics are DF exactly as constructed before and DF T 1 1 1 p 1 Est Std Error

    The advantage of this formulation is that it can accommodate higher order autoregressive processes in t An alternative formulation may prove convenient By subtracting yt 1 from both sides of the equation we obtain yt yt 1 where
    p p p 1

    j yt j t
    j 1

    j
    k j 1

    k

    and


    i 1

    i

    1

    Greene 50240

    book

    June 27 2002

    21 11

    644

    CHAPTER 20 Time Series Models

    The unit root test is carried out as before by testing the null hypothesis 0 against 0 22 The t test DF may be used If the failure to reject the unit root is taken as evidence that a unit root is present i e 0 then the model specializes to the AR p 1 model in the rst differences which is an ARIMA p 1 1 0 model for yt For a model with a time trend yt t yt 1
    p 1

    j yt j t
    j 1

    the test is carried out by testing the joint hypothesis that 0 Dickey and Fuller 1981 present counterparts to the critical F statistics for testing the hypothesis Some of their values are reproduced in the rst row of Table 20 4 Authors frequently focus on and ignore the time trend maintaining it only as part of the appropriate formulation In this case one may use the simple test of 0 as before with the DF critical values The lag length p remains to be determined As usual we are well advised to test down to the right value instead of up One can take the familiar approach and sequentially examine the t statistic on the last coef cient the usual t test is appropriate An alternative is to combine a measure of model t such as the regression s 2 with one of the information criteria The Akaike and Schwartz Bayesian information criteria would produce the two information measures IC p ln ee T pmax K p K A T pmax K

    K 1 for random walk 2 for random walk with drift 3 for trend stationary A 2 for Akaike criterion ln T pmax K for Bayesian criterion pmax the largest lag length being considered The remaining detail is to decide upon pmax The theory provides little guidance here On the basis of a large number of simulations Schwert 1989 found that pmax integer part of 12 T 100 25 gave good results Many alternatives to the Dickey Fuller tests have been suggested in some cases to improve on the nite sample properties and in others to accommodate more general modeling frameworks The Phillips 1987 and Phillips and Perron 1988 statistic may be computed for the same three functional forms yy t yt 1 1 yt 1 p yt p t 20 23

    where t may be 0 or t The procedure modi es the two Dickey Fuller statistics we examined above Z Z
    22 It

    c0 a

    1 v

    Tv 1 a c0 2 as 2

    T 1 1 T 2v2 a c0 1 1 p 2 s2

    is easily veri ed that one of the roots of the characteristic polynomial is 1 1 2 p

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    645

    where s2
    2

    T K
    T

    T 2 t 1 et

    v estimated asymptotic variance of 1 cj T et et s
    s j 1

    j 0 p j th autocovariance of residuals

    c0 T K T s 2
    L

    a c0 2
    j 1

    1

    j cj L 1

    Note the Newey West Bartlett weights in the computation of a As before the analyst must choose L The test statistics are referred to the same Dickey Fuller tables we have used before Elliot Rothenberg and Stock 1996 have proposed a method they denote the ADF GLS procedure which is designed to accommodate more general formulations of the process generating t is assumed to be an I 0 stationary process possibly an ARMA r s The null hypothesis as before is 1 in 20 23 where t or t The method proceeds as follows Step 1 Linearly regress y1 1 1 y2 r y1 1 r 1 r on X y or X 1 r 1 r yT r yT 1

    1 2 r T r T 1

    for the random walk with drift and trend stationary cases respectively Note that the second column of the matrix is simply r 1 r t Compute the residuals from this regression yt yt t r 1 7 T for the random walk model and 1 13 5 T for the model with a trend Step 2 The Dickey Fuller DF test can now be carried out using the model yy yt 1 1 yt 1 p yt p t If the model does not contain the time trend then the t statistic for 1 may be referred to the critical values in the center panel of Table 20 4 For the trend stationary model the critical values are given in a table presented in Elliot et al The 97 5 percent critical values for a one tailed test from their table is 3 15 As in many such cases of a new technique as researchers develop large and small modi cations of these tests the practitioner is likely to have some dif culty deciding how to proceed The Dickey Fuller procedures have stood the test of time as robust tools that appear to give good results over a wide range of applications The Phillips Perron tests are very general but appear to have less than optimal small sample properties Researchers continue to examine it and the others such as Elliot et al method Other tests are catalogued in Maddala and Kim 1998

    Greene 50240

    book

    June 27 2002

    21 11

    646

    CHAPTER 20 Time Series Models Example 20 6 Augmented Dickey Fuller Test for a Unit Root in GDP

    The Dickey Fuller 1981 JASA paper is a classic in the econometrics literature it is probably the single most frequently cited paper in the eld It seems appropriate therefore to revisit at least some of their work Dickey and Fuller apply their methodology to a model for the log of a quarterly series on output the Federal Reserve Board Production Index The model used is yt t yt 1 yt 1 yt 2 t 20 24

    The test is carried out by testing the joint hypothesis that both and are zero in the model yt yt 1 t yt 1 yt 1 yt 2 t If 0 then will also by construction We will repeat the study with our data on real GNP from Appendix Table F5 1 using observations 1950 1 to 2000 4 We will use the augmented Dickey Fuller test rst Thus the rst step is to determine the appropriate lag length for the augmented regression Using Schwert s suggestion we nd that the maximum lag length should be allowed to reach pmax the integer part of 12 204 100 25 14 The speci cation search uses observations 18 to 204 since as many as 17 coef cients will be estimated in the equation
    p

    yt t yt 1
    j 1

    j yt j t

    In the sequence of 14 regressions with j 14 13 the only statistically signi cant lagged difference is the rst one in the last regression so it would appear that the model used by Dickey and Fuller would be chosen on this basis The two information criteria produce a similar conclusion Both of them decline monotonically from j 14 all the way down to j 1 so on this basis we end the search with j 1 and proceed to analyze Dickey and Fuller s model The linear regression results for the equation in 20 24 are yt 0 368 0 000391t 0 952 yt 1 0 36025 yt 1 et 0 125 0 000138 The two test statistics are DF and DFc 201 0 95166 1 15 263 1 0 36025 0 95166 1 2 892 0 016716 0 0167 0 0647 s 0 00912 R2 0 999647

    Neither statistic is less than the respective critical values which are 3 70 and 24 5 On this basis we conclude as have many others that there is a unit root in log GDP For the Phillips and Perron statistic we need several additional intermediate statistics Following Hamilton 1994 page 512 we choose L 4 for the long run variance calculation Other values we need are T 201 0 9516613 s2 0 00008311488 v2 0 00027942647 and the rst ve autocovariances c0 0 000081469 c1 0 00000351162 c2 0 00000688053 c3 0 000000597305 and c4 0 00000128163 Applying these to the weighted sum produces a 0 0000840722 which is only a minor correction to c0 Collecting the results we obtain the Phillips Perron statistics Z 2 89921 and Z 15 44133 Since these are applied to the same critical values in the Dickey Fuller tables we reach the same conclusion as before we do not reject the hypothesis of a unit root in log GDP

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models 20 3 5 LONG MEMORY MODELS

    647

    The autocorrelations of an integrated series I 1 or I 2 display the characteristic pattern shown in Table 20 3 for the log of the GNP de ator They remain persistently extremely high at long lags In contrast the autocorrelations of a stationary process typically decay at an exponential rate so large values typically cease to appear after only a few lags See e g the rightmost panel of Table 20 3 Some processes appear to behave between these two benchmarks they are clearly nonstationary yet when differenced they appear to show the characteristic alternating positive and negative autocorrelations still out to long lags that suggest overdifferencing But the undifferenced data show signi cant autocorrelations out to very long lags Stock returns Lo 1991 and exchange rates Cheung 1993 provide some cases that have been studied In a striking example Ding Granger and Engle 1993 found signi cant autocorrelations out to lags of well over 2 000 days in the absolute values of daily stock market returns See also Granger and Ding 1996 There is ample evidence of a lack of memory in stock market returns but a spate of recent evidence such as this has been convincing that the volatility the absolute value resembles the standard deviation in stock returns has extremely long memory Although it is clear that an extension of the standard models of stationary time series is needed to explain the persistence of the effects of shocks on for example GDP and the money stock and models of unit roots and cointegration see Section 20 4 do appear to be helpful there remains something of a statistical balancing act in their construction If the root differs from one in either direction then an altogether different set of statistical tools is called for For models of very long term autocorrelation which likewise re ect persistent response to shocks models of long term memory have provided a very useful extension of the concept of nonstationarity 23 The basic building block in this class of models is the fractionally integrated white noise series 1 L d yt t This time series has an in nite moving average representation if d 1 but it is non2 stationary For d 0 the sequence of autocorrelations k k 0 is not absolutely summable For this simple model k k d 1 d k d 1 d

    The rst 50 values of k are shown in Figure 20 15 for d 0 1 0 25 0 4 and 0 475 The Ding Granger and Engle computations display a pattern similar to that shown for 0 25 in the gure See Granger and Ding 1996 p 66 The natural extension of the model that allows for more intricate patterns in the data is the autoregressive fractionally integrated moving average or ARFIMA p d q model 1 L d yt 1 yt 1 p yt p t 1 t 1 q t q
    23 These

    yt Yt

    models appear to have originated in the physical sciences early in the 1950s especially with Hurst 1951 whose name is given to the effect of very long term autocorrelation in observed time series The pioneering work in econometrics is that of Taqqu 1975 Granger and Joyeux 1980 Granger 1981 Hosking 1981 and Geweke and Porter Hudak 1983 An extremely thorough summary and an extensive bibliography are given in Baillie 1996

    Greene 50240

    book

    June 27 2002

    21 11

    648

    CHAPTER 20 Time Series Models

    R1 1 0

    R25

    R4

    R475

    0 8

    0 6

    0 4

    0 2

    0 0 0 10 20 k
    FIGURE 20 15 Autocorrelations for a Fractionally Integrated Time Series

    30

    40

    50

    Estimation of ARFIMA models is discussed in Baillie 1996 and the references cited there A test for fractional integration effects is suggested by Geweke and Porter Hudak 1983 The test is based on the slope in the linear regression of the logs of the rst n T values from the sample periodogram of yt that is zk log hY k on the corresponding Here n T is taken to be functions of the rst n T frequencies xk log 4 sin2 k 2 reasonably small Geweke and Porter Hudak suggest n T T A conventional t test of the hypothesis that the slope equals zero is used to test the hypothesis
    Example 20 7 Long Term Memory in the Growth of Real GNP

    For the real GDP series analyzed in Example 20 6 we analyze the subseries 1950 3 to 1983 4 with T 135 so we take n T 12 The frequencies used for the periodogram are 2 k 135 k 1 12 The rst 12 values from the periodogram are 0 05104 0 4322 0 7227 0 3659 1 353 1 257 0 05533 1 388 0 5955 0 2043 0 3040 0 6381 The linear regression produces an estimate of d of 0 2505 with a standard error of 0 225 Thus the hypothesis that d equals zero cannot be rejected This result is not surprising the rst seven autocorrelations of the series are 0 491 0 281 0 044 0 076 0 120 0 052 and 0 018 They are trivial thereafter suggesting that the series is in fact stationary This assumption in itself creates something of an ambiguity The log of the real GNP series does indeed appear to be I 1 But the price level used to compute real GNP is fairly convincingly I 2 or at least I 1 d for some d greater than zero judging from Figure 20 7 As such the log of real GNP is the log of a variable that is probably at least I 1 d Although received results are not de nitive this result is probably not I 1

    Models of long term memory have been extended in many directions and the results have been fully integrated with the unit root platform discussed earlier Baillie s survey details many of the recently developed methods

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models Example 20 8 Long Term Memory in Foreign Exchange Markets

    649

    Cheung 1993 applied the long term memory model to a study of end of week exchange rates for 16 years 1974 to 1989 The time series studied were the dollar spot rates of the British pound BP Deutsche mark DM Swiss franc SF French franc FF and Japanese yen JY Testing and estimation were done using the 1974 to 1987 data The nal 2 years of the sample were held out for out of sample forecasting Data were analyzed in the form of rst differences of the logs so that observations are week to week percentage changes Plots of the data did not suggest any obvious deviation from stationarity As an initial assessment the undifferenced data were subjected to augmented Dickey Fuller tests for unit roots and the hypothesis could not be rejected Thus analysis proceeded using the rst differences of the logs The GPH test using n T T for long memory in the rst differences produced the following estimates of d with estimated p values in parentheses The p value is the standard normal probability that N 0 1 is greater than or equal to the ratio of the estimated d to its estimated standard error These tests are one sided tests Values less than 0 05 indicate statistical signi cance by the usual conventions Currency BP d 0 1869 p value 0 106 DM 0 2943 0 025 SF 0 2870 0 028 JY 0 2907 0 026 FF 0 4240 0 003

    The unit root hypothesis is rejected in favor of the long memory model in four of the ve cases The author proceeded to estimate ARFIMA p d q models The coef cients of the ARFIMA models d is recomputed are small in all cases save for the French franc for which the estimated model is 1 L 0 3664 F Ft F F 0 4776 F Ft 1 F F 0 1227 F Ft 2 F F et 0 8657et 1

    20 4

    COINTEGRATION

    Studies in empirical macroeconomics almost always involve nonstationary and trending variables such as income consumption money demand the price level trade ows and exchange rates Accumulated wisdom and the results of the previous sections suggest that the appropriate way to manipulate such series is to use differencing and other transformations such as seasonal adjustment to reduce them to stationarity and then to analyze the resulting series as VARs or with the methods of Box and Jenkins But recent research and a growing literature has shown that there are more interesting appropriate ways to analyze trending variables In the fully speci ed regression model yt xt t there is a presumption that the disturbances t are a stationary white noise series 24 But this presumption is unlikely to be true if yt and xt are integrated series Generally if two series are integrated to different orders then linear combinations of them will be integrated to the higher of the two orders Thus if yt and xt are I 1 that is if both are trending variables then we would normally expect yt xt to be I 1 regardless of the value of not I 0 i e not stationary If yt and xt are each drifting upward
    24 If

    there is autocorrelation in the model then it has been removed through an appropriate transformation

    Greene 50240

    book

    June 27 2002

    21 11

    650

    CHAPTER 20 Time Series Models

    with their own trend then unless there is some relationship between those trends the difference between them should also be growing with yet another trend There must be some kind of inconsistency in the model On the other hand if the two series are both I 1 then there may be a such that t yt xt is I 0 Intuitively if the two series are both I 1 then this partial difference between them might be stable around a xed mean The implication would be that the series are drifting together at roughly the same rate Two series that satisfy this requirement are said to be cointegrated and the vector 1 or any multiple of it is a cointegrating vector In such a case we can distinguish between a long run relationship between yt and xt that is the manner in which the two variables drift upward together and the short run dynamics that is the relationship between deviations of yt from its long run trend and deviations of xt from its long run trend If this is the case then differencing of the data would be counterproductive since it would obscure the long run relationship between yt and xt Studies of cointegration and a related technique error correction are concerned with methods of estimation that preserve the information about both forms of covariation 25
    Example 20 9 Cointegration in Consumption and Output

    Consumption and income provide one of the more familiar examples of the phenomenon described above The logs of GDP and consumption for 1950 1 to 2000 4 are plotted in Figure 20 16 Both variables are obviously nonstationary We have already veri ed that there is a unit root in the income data We leave as an exercise for the reader to verify that consumption variable is likewise I 1 Nonetheless there is a clear relationship between consumption and output To see where this discussion of relationships among variables is going consider a simple regression of the log of consumption on the log of income where both variables are manipulated in mean deviation form so the regression includes a constant The slope in that regression is 1 056765 The residuals from the regression ut lnCons lnGDP 1 1 056765 where the indicates mean deviations are plotted in Figure 20 17 The trend is clearly absent from the residuals But it remains to verify whether the series of residuals is stationary In the ADF regression of the least squares residuals on a constant random walk with drift the lagged value and the lagged rst difference the coef cient on ut 1 is 0 838488 0 0370205 and that on ut 1 ut 2 is 0 098522 The constant differs trivially from zero because two observations are lost in computing the ADF regression With 202 observations we nd DF 4 63 and DF 29 55 Both are well below the critical values which suggests that the residual series does not contain a unit root We conclude at least it appears so that even after accounting for the trend although neither of the original variables is stationary there is a linear combination of them that is If this conclusion holds up after a more formal treatment of the testing procedure we will state that logGDP and log consumption are cointegrated
    Example 20 10 Several Cointegrated Series

    The theory of purchasing power parity speci es that in long run equilibrium exchange rates will adjust to erase differences in purchasing power across different economies Thus if p1 and p0 are the price levels in two countries and E is the exchange rate between the two currencies then in equilibrium p1t vt E t a constant p0t
    25 See

    for example Engle and Granger 1987 and the lengthy literature cited in Hamilton 1994 A survey paper on VARs and cointegration is Watson 1994

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    651

    Cointegrated Variables Logs of GDP and Consumption 9 6
    LOGGDP LOGCONS

    9 0

    8 4 Logs 7 8 7 2 6 6 1949
    FIGURE 20 16

    1962

    1975 Quarter

    1988

    2001

    Logs of Consumption and GDP

    Residuals from Consumption Income Regression 075

    050

    Residual

    025

    000

    025

    050 1950
    FIGURE 20 17

    1963

    1976 Quarter

    1989

    2002

    Regression Residuals

    Greene 50240

    book

    June 27 2002

    21 11

    652

    CHAPTER 20 Time Series Models

    The price levels in any two countries are likely to be strongly trended But allowing for shortterm deviations from equilibrium the theory suggests that for a particular ln 1 1 in the model ln E t 1 2 ln p1t 3 ln p0t t t ln vt would be a stationary series which would imply that the logs of the three variables in the model are cointegrated

    We suppose that the model involves M variables yt y1t yMt which individually may be I 0 or I 1 and a long run equilibrium relationship yt xt 0 The regressors may include a constant exogenous variables assumed to be I 0 and or a time trend The vector of parameters is the cointegrating vector In the short run the system may deviate from its equilibrium so the relationship is rewritten as yt xt t where the equilibrium error t must be a stationary series In fact since there are M variables in the system at least in principle there could be more than one cointegrating vector In a system of M variables there can only be up to M 1 linearly independent cointegrating vectors A proof of this proposition is very simple but useful at this point Proof Suppose that i is a cointegrating vector and that there are M linearly independent cointegrating vectors Then neglecting xt for the moment for every i yt i is a stationary series ti Any linear combination of a set of stationary series is stationary so it follows that every linear combination of the cointegrating vectors is also a cointegrating vector If there are M such M 1 linearly independent vectors then they form a basis for the M dimensional space so any M 1 vector can be formed from these cointegrating vectors including the columns of an M M identity matrix Thus the rst column of an identity matrix would be a cointegrating vector or yt 1 is I 0 This result is a contradiction since we are allowing yt 1 to be I 1 It follows that there can be at most M 1 cointegrating vectors The number of linearly independent cointegrating vectors that exist in the equilibrium system is called its cointegrating rank The cointegrating rank may range from 1 to M 1 If it exceeds one then we will encounter an interesting identi cation problem As a consequence of the observation in the preceding proof we have the unfortunate result that in general if the cointegrating rank of a system exceeds one then without out of sample exact information it is not possible to estimate behavioral relationships as cointegrating vectors Enders 1995 provides a useful example
    Example 20 11 Multiple Cointegrating Vectors

    We consider the logs of four variables money demand m the price level p real income y and an interest rate r The basic relationship is m 0 1 p 2 y 3 r The price level and real income are assumed to be I 1 The existence of long run equilibrium in the money market implies a cointegrating vector 1 If the Fed follows a certain feedback rule increasing the money stock when nominal income y p is low and decreasing it when

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    653

    nominal income is high which might make more sense in terms of rates of growth then there is a second cointegrating vector in which 1 2 and 3 0 Suppose that we label this vector 2 The parameters in the money demand equation notably the interest elasticity are interesting quantities and we might seek to estimate 1 to learn the value of this quantity But since every linear combination of 1 and 2 is a cointegrating vector to this point we are only able to estimate a hash of the two cointegrating vectors In fact the parameters of this model are identi able from sample information in principle We have speci ed two cointegrating vectors 1 1 10 11 12 13 and 2 1 20 21 21 0 Although it is true that every linear combination of 1 and 2 is a cointegrating vector only the original two vectors as they are have ones in the rst position of both and a 0 in the last position of the second The equality restriction actually overidenti es the parameter matrix This result is of course exactly the sort of analysis that we used in establishing the identi ability of a simultaneous equations system
    20 4 1 COMMON TRENDS

    If two I 1 variables are cointegrated then some linear combination of them is I 0 Intuition should suggest that the linear combination does not mysteriously create a well behaved new variable rather something present in the original variables must be missing from the aggregated one Consider an example Suppose that two I 1 variables have a linear trend y1t t ut y2t t vt where ut and vt are white noise A linear combination of y1t and y2t with vector 1 produces the new variable zt t ut vt which in general is still I 1 In fact the only way the zt series can be made stationary is if If so then the effect of combining the two variables linearly is to remove the common linear trend which is the basis of Stock and Watson s 1988 analysis of the problem But their observation goes an important step beyond this one The only way that y1t and y2t can be cointegrated to begin with is if they have a common trend of some sort To continue suppose that instead of the linear trend t the terms on the right hand side y1 and y2 are functions of a random walk wt wt 1 t where t is white noise The analysis is identical But now suppose that each variable yit has its own random walk component wit i 1 2 Any linear combination of y1t and y2t must involve both random walks It is clear that they cannot be cointegrated unless in fact w1t w2t That is once again they must have a common trend Finally suppose that y1t and y2t share two common trends y1t t wt ut y2t t wt vt

    Greene 50240

    book

    June 27 2002

    21 11

    654

    CHAPTER 20 Time Series Models

    We place no restriction on and Then a bit of manipulation will show that it is not possible to nd a linear combination of y1t and y2t that is cointegrated even though they share common trends The end result for this example is that if y1t and y2t are cointegrated then they must share exactly one common trend As Stock and Watson determined the preceding is the crux of the cointegration of economic variables A set of M variables that are cointegrated can be written as a stationary component plus linear combinations of a smaller set of common trends If the cointegrating rank of the system is r then there can be up to M r linear trends and M r common random walks See Hamilton 1994 p 578 The two variable case is special In a two variable system there can be only one common trend in total The effect of the cointegration is to purge these common trends from the resultant variables
    20 4 2 ERROR CORRECTION AND VAR REPRESENTATIONS

    Suppose that the two I 1 variables yt and zt are cointegrated and that the cointegrating vector is 1 Then all three variables yt yt yt 1 zt and yt zt are I 0 The error correction model yt xt zt yt 1 zt 1 t describes the variation in yt around its long run trend in terms of a set of I 0 exogenous factors xt the variation of zt around its long run trend and the error correction yt zt which is the equilibrium error in the model of cointegration There is a tight connection between models of cointegration and models of error correction The model in this form is reasonable as it stands but in fact it is only internally consistent if the two variables are cointegrated If not then the third term and hence the right hand side cannot be I 0 even though the left hand side must be The upshot is that the same assumption that we make to produce the cointegration implies and is implied by the existence of an error correction model 26 As we will examine in the next section the utility of this representation is that it suggests a way to build an elaborate model of the long run variation in yt as well as a test for cointegration Looking ahead the preceding suggests that residuals from an estimated cointegration model that is estimated equilibrium errors can be included in an elaborate model of the long run covariation of yt and zt Once again we have the foundation of Engel and Granger s approach to analyzing cointegration Consider the VAR representation of the model yt yt 1 t where the vector yt is yt zt Now take rst differences to obtain yt yt 1 I yt 1 t or yt yt 1 t

    If all variables are I 1 then all M variables on the left hand side are I 0 Whether those on the right hand side are I 0 remains to be seen The matrix produces linear
    26 The result in its general form is known as the Granger representation theorem See Hamilton 1994 p 582

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    655

    combinations of the variables in yt But as we have seen not all linear combinations can be cointegrated The number of such independent linear combinations is r M Therefore although there must be a VAR representation of the model cointegration implies a restriction on the rank of It cannot have full rank its rank is r From another viewpoint a different approach to discerning cointegration is suggested Suppose that we estimate this model as an unrestricted VAR The resultant coef cient matrix should be short ranked The implication is that if we t the VAR model and impose short rank on the coef cient matrix as a restriction how we could do that remains to be seen then if the variables really are cointegrated this restriction should not lead to a loss of t This implication is the basis of Johansen s 1988 and Stock and Watson s 1988 analysis of cointegration
    20 4 3 TESTING FOR COINTEGRATION

    A natural rst step in the analysis of cointegration is to establish that it is indeed a characteristic of the data Two broad approaches for testing for cointegration have been developed The Engle and Granger 1987 method is based on assessing whether single equation estimates of the equilibrium errors appear to be stationary The second approach due to Johansen 1988 1991 and Stock and Watson 1988 is based on the VAR approach As noted earlier if a set of variables is truly cointegrated then we should be able to detect the implied restrictions in an otherwise unrestricted VAR We will examine these two methods in turn Let yt denote the set of M variables that are believed to be cointegrated Step one of either analysis is to establish that the variables are indeed integrated to the same order The Dickey Fuller tests discussed in Section 20 3 4 can be used for this purpose If the evidence suggests that the variables are integrated to different orders or not at all then the speci cation of the model should be reconsidered If the cointegration rank of the system is r then there are r independent vectors i 1 i where each vector is distinguished by being normalized on a different variable If we suppose that there are also a set of I 0 exogenous variables including a constant in the model then each cointegrating vector produces the equilibrium relationship y t i xt t which we may rewrite as yit Yi t i xt t We can obtain estimates of i by least squares regression If the theory is correct and if this OLS estimator is consistent then residuals from this regression should estimate the equilibrium errors There are two obstacles to consistency First since both sides of the equation contain I 1 variables the problem of spurious regressions appears Second a moment s thought should suggest that what we have done is extract an equation from an otherwise ordinary simultaneous equations model and propose to estimate its parameters by ordinary least squares As we examined in Chapter 15 consistency is unlikely in that case It is one of the extraordinary results of this body of theory that in this setting neither of these considerations is a problem In fact as shown by a number of authors see e g Davidson and MacKinnon 1993 not only is ci the

    Greene 50240

    book

    June 27 2002

    21 11

    656

    CHAPTER 20 Time Series Models

    OLS estimator of i consistent it is superconsistent in that its asymptotic variance is O 1 T 2 rather than O 1 T as in the usual case Consequently the problem of spurious regressions disappears as well Therefore the next step is to estimate the cointegrating vector s by OLS Under all the assumptions thus far the residuals from these regressions eit are estimates of the equilibrium errors it As such they should be I 0 The natural approach would be to apply the familiar Dickey Fuller tests to these residuals The logic is sound but the Dickey Fuller tables are inappropriate for these estimated errors Estimates of the appropriate critical values for the tests are given by Engle and Granger 1987 Engle and Yoo 1987 Phillips and Ouliaris 1990 and Davidson and MacKinnon 1993 If autocorrelation in the equilibrium errors is suspected then an augmented Engle and Granger test can be based on the template eit ei t 1 1 ei t 1 ut If the null hypothesis that 0 cannot be rejected against the alternative 0 then we conclude that the variables are not cointegrated Cointegration can be rejected by this method Failing to reject does not con rm it of course But having failed to reject the presence of cointegration we will proceed as if our nding had been af rmative
    Example 20 9 Continued Consumption and Output

    In the example presented at the beginning of this discussion we proposed precisely the sort of test suggested by Phillips and Ouliaris 1990 to determine if log consumption and log GDP are cointegrated As noted the logic of our approach is sound but a few considerations remain The Dickey Fuller critical values suggested for the test are appropriate only in a few cases and not when several trending variables appear in the equation For the case of only a pair of trended variables as we have here one may use in nite sample values in the Dickey Fuller tables for the trend stationary form of the equation The drift and trend would have been removed from the residuals by the original regression which would have these terms either embedded in the variables or explicitly in the equation Finally there remains an issue of how many lagged differences to include in the ADF regression We have speci ed one though further analysis might be called for A lengthy discussion of this set of issues appears in Hayashi 2000 pp 645 648 Thus but for the possibility of this speci cation issue the ADF approach suggested in the introduction does pass muster The sample value found earlier was 4 63 The critical values from the table are 3 45 for 5 percent and 3 67 for 2 5 percent Thus we conclude as have many other analysts that log consumption and log GDP are cointegrated

    The Johansen 1988 1992 and Stock and Watson 1988 methods are similar so we will describe only the rst one The theory is beyond the scope of this text although the operational details are suggestive To carry out the Johansen test we rst formulate the VAR yt
    1 yt 1



    2 yt 2



    p yt p

    t

    The order of the model p must be determined in advance Now let zt denote the vector of M p 1 variables zt yt 1 yt 2 yt p 1 That is zt contains the lags 1 to p 1 of the rst differences of all M variables Now using the T available observations we obtain two T M matrices of least squares residuals D the residuals in the regressions of yt on zt E the residuals in the regressions of yt p on zt

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    657

    We now require the M squared canonical correlations between the columns in D and those in E To continue we will digress brie y to de ne the canonical correlations Let d denote a linear combination of the columns of D and let e denote the same from 1 1 E We wish to choose these two linear combinations so as to maximize the correlation between them This pair of variables are the rst canonical variates and their correlation r1 is the rst canonical correlation In the setting of cointegration this computation has some intuitive appeal Now with d and e in hand we seek a second pair of variables d 2 1 1 and e to maximize their correlation subject to the constraint that this second variable 2 in each pair be orthogonal to the rst This procedure continues for all M pairs of variables It turns out that the computation of all these is quite simple We will not need to compute the coef cient vectors for the linear combinations The squared canonical correlations are simply the ordered characteristic roots of the matrix R R DD R DE R 1 R EDR DD EE where Ri j is the cross correlation matrix between variables in set i and set j for i j D E Finally the null hypothesis that there are r or fewer cointegrating vectors is tested using the test statistic
    M 1 2 1 2

    TRACE TEST T
    i r 1

    ln 1 ri 2

    If the correlations based on actual disturbances had been observed instead of estimated then we would refer this statistic to the chi squared distribution with M r degrees of freedom Alternative sets of appropriate tables are given by Johansen and Juselius 1990 and Osterwald Lenum 1992 Large values give evidence against the hypothesis of r or fewer cointegrating vectors
    20 4 4 ESTIMATING COINTEGRATION RELATIONSHIPS

    Both of the testing procedures discussed above involve actually estimating the cointegrating vectors so this additional section is actually super uous In the Engle and Granger framework at a second step after the cointegration test we can use the residuals from the static regression as an error correction term in a dynamic rst difference regression as shown in Section 20 4 2 One can then test down to nd a satisfactory structure In the Johansen test shown earlier the characteristic vectors corresponding to the canonical correlations are the sample estimates of the cointegrating vectors Once again computation of an error correction model based on these rst step results is a natural next step We will explore these in an application
    20 4 5 APPLICATION GERMAN MONEY DEMAND

    The demand for money has provided a convenient and well targeted illustration of methods of cointegration analysis The central equation of the model is mt pt yt i t t 20 25

    where mt pt and yt are the logs of nominal money demand the price level and output and i is the nominal interest rate not the log of The equation involves trending variables mt pt yt and one which we found earlier appears to be a random walk with

    Greene 50240

    book

    June 27 2002

    21 11

    658

    CHAPTER 20 Time Series Models

    drift i t As such the usual form of statistical inference for estimation of the income elasticity and interest semielasticity based on stationary data is likely to be misleading Beyer 1998 analyzed the demand for money in Germany over the period 1975 to 1994 A central focus of the study was whether the 1990 reuni cation produced a structural break in the long run demand function The analysis extended an earlier study by the same author that was based on data which predated the reuni cation One of the interesting questions pursued in this literature concerns the stability of the long term demand equation m p t yt i t t 20 26

    The left hand side is the log of the inverse of the velocity of money as suggested by Lucas 1988 An issue to be confronted in this speci cation is the exogeneity of the interest variable exogeneity in the Engle Hendry and Richard 1993 sense of income is moot in the long run equation as its coef cient is assumed per Lucas to equal one Beyer explored this latter issue in the framework developed by Engle et al see Section 19 6 4 and through the Granger causality testing methods discussed in Section 19 6 5 The analytical platform of Beyer s study is a long run function for the real money stock M3 we adopt the author s notation m p 0 1 y 2 RS 3 RL 4
    4p

    20 27

    where RS is a short term interest rate RL is a long term interest rate and 4 p is the annual in ation rate the data are quarterly The rst step is an examination of the data Augmented Dickey Fuller tests suggest that for these German data in this period mt and pt are I 2 while mt pt yt 4 pt RSt and RLt are all I 1 Some of Beyer s results which produced these conclusions are shown in Table 20 7 Note that though both mt and pt appear to be I 2 their simple difference linear combination is I 1 that is integrated to a lower order That produces the long run speci cation given by 20 27 The Lucas speci cation is layered onto this to produce the model for the longrun velocity
    m p y 0 2 RS 3 RL 4 4 p

    20 28

    TABLE 20 7 Variable m

    Augmented Dickey Fuller Tests for Variables in the Beyer Model
    m
    2

    m

    p

    p

    2

    p

    4p

    4p

    Spec lag DF Crit Value Variable Spec lag DF Crit Value

    TS 0 1 82 3 47 y TS 4 1 83 3 47

    RW 4 1 61 1 95 y RW D 3 2 91 2 90

    RW 3 6 87 1 95 RS TS 1 2 33 2 90

    TS 4 2 09 3 47 RS RW 0 5 26 1 95

    RW D 3 2 14 2 90 RL TS 1 2 40 2 90

    RW 2 10 6 1 95 RL RW 0 6 01 1 95

    RW D 2 2 66 2 90 m p RW D 0 1 65 3 47

    RW 2 5 48 1 95 m p RW D 0 8 50 2 90

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models 20 4 5a Cointegration Analysis and a Long Run Theoretical Model

    659

    In order for 20 27 to be a valid model there must be at least one cointegrating vector that transforms zt mt pt yt RSt RLt 4 pt to stationarity The Johansen trace test described in Section 20 4 3 was applied to the VAR consisting of these ve I 1 variables A lag length of two was chosen for the analysis The results of the trace test are a bit ambiguous the hypothesis that r 0 is rejected albeit not strongly sample value 90 17 against a 95 critical value 87 31 while the hypothesis that r 1 is not rejected sample value 60 15 against a 95 critical value of 62 99 These borderline results follow from the result that Beyer s rst three eigenvalues canonical correlations in the trace test statistic are nearly equal Variation in the test statistic results from variation in the correlations On this basis it is concluded that the cointegrating rank equals one The unrestricted cointegrating vector for the equation with a time trend added is found to be m p 0 936 y 1 780
    4p

    1 601RS 3 279RL 0 002t

    20 29

    These are the coef cients from the rst characteristic vector of the canonical correlation analysis in the Johansen computations detailed in Section 20 4 3 An exogeneity test we have not developed this in detail see Beyer 1998 p 59 Hendry and Ericsson 1991 and Engle and Hendry 1993 con rms weak exogeneity of all four right hand side variables in this speci cation The nal speci cation test is for the Lucas formulation and elimination of the time trend both of which are found to pass producing the cointegration vector m p y 1 832
    4p

    4 352RS 10 89RL

    The conclusion drawn from the cointegration analysis is that a single equation model for the long run money demand is appropriate and a valid way to proceed A last step before this analysis is a series of Granger causality tests for feedback between changes in the money stock and the four right hand variables in 20 29 not including the trend See Section 19 6 5 The test results are generally favorable with some mixed results for exogeneity of GDP
    20 4 5b Testing for Model Instability

    Let zt mt pt yt 4 pt RSt RLt and let z0 1 denote the entire history of zt up t to the previous period The joint distribution for zt conditioned on z0 1 and a set of t parameters factors one level further into f zt z0 1 t f m p t yt g yt
    0 4 pt RSt RLt zt 1 2 1

    0 4 pt RSt RLt zt 1



    The result of the exogeneity tests carried out earlier implies that the conditional distribution may be analyzed apart from the marginal distribution that is the implication of the Engle Hendry and Richard results noted earlier Note the partitioning of the parameter vector Thus the conditional model is represented by an error correction form that explains m p t in terms of its own lags the error correction term and contemporaneous and lagged changes in the now established weakly exogenous

    Greene 50240

    book

    June 27 2002

    21 11

    660

    CHAPTER 20 Time Series Models

    variables as well as other terms such as a constant term trend and certain dummy variables which pick up particular events The error correction model speci ed is
    4 4 4

    m p t
    i 1

    ci m p t i
    i 0 4 4

    d1 i

    4 pt i


    i 0

    d2 i yt i


    i 0

    d3 i RSt i
    i 0

    d4 i RLt i m p y t 1

    20 30

    1 RSt 1 2 RLt 1 dt t where dt is the set of additional variables including the constant and ve one period dummy variables that single out speci c events such as a currency crisis in September 1992 Beyer 1998 page 62 fn 4 The model is estimated by least squares stepwise simpli ed and reparameterized The number of parameters in the equation is reduced from 32 to 15 27 The estimated form of 20 30 is an autoregressive distributed lag model We proceed to use the model to solve for the long run steady state growth path of the real money stock 21 27 The annual growth rates 4 m gm 4 p g p 4 y g y and assumed 4 RS gRS 4 RL gRL 0 are used for the solution 1 c4 d2 2 gm g p gm g p d1 1 g p g y 1 RS 2 RL m p y 28 4 4 2 This equation is solved for m p under the assumption that gm g y g p m p 0 1 g y y 2 4 p 3 RS 4 RL

    Analysis then proceeds based on this estimated long run relationship The primary interest of the study is the stability of the demand equation pre and postuni cation A comparison of the parameter estimates from the same set of procedures using the period 1976 1989 shows them to be surprisingly similar 1 22 3 67g y 1 3 67 3 67 6 44 for the earlier period and 1 25 2 09g y 1 3 625 3 5 7 25 for the later one This suggests albeit informally that the function has not changed at least by much A variety of testing procedures for structural break including the Andrews and Ploberger tests discussed in Section 7 4 led to the conclusion that in spite of the dramatic changes of 1990 the long run money demand function had not materially changed in the sample period

    20 5

    SUMMARY AND CONCLUSIONS

    This chapter has completed our survey of techniques for the analysis of time series data While Chapter 19 was about extensions of regression modeling to time series setting most of the results in this Chapter focus on the internal structure of the individual time series themselves We began with the standard models for time series processes While
    27 The

    equation ultimately used is mt pt h m p t 4 RSt 1 RLt 1 4 pt 1 m p y t 1 dt

    4 pt

    2y t 2

    RSt 1 RSt 3

    2 RL t

    28 The division of the coef cients is done because the intervening lags do not appear in the estimated equation

    Greene 50240

    book

    June 27 2002

    21 11

    CHAPTER 20 Time Series Models

    661

    the empirical distinction between say AR p and MA q series may seem ad hoc the Wold decomposition assures that with enough care a variety of models can be used to analyze a time series Section 20 2 described what is arguably the fundamental tool of modern macroeconometrics the tests for nonstationarity Contemporary econometric analysis of macroeconomic data has added considerable structure and formality to trending variables which are more common than not in that setting The variants of the Dickey Fuller tests for unit roots are an indispensable tool for the analyst of timeseries data Section 20 4 then considered the subject of cointegration This modeling framework is a distinct extension of the regression modeling where this discussion began Cointegrated relationships and equilibrium relationships form the basis the timeseries counterpart to regression relationships But in this case it is not the conditional mean as such that is of interest Here both the long run equilibrium and short run relationships around trends are of interest and are studied in the data Key Terms and Concepts
    Autoregressive integrated

    moving average ARIMA process Augmented Dickey Fuller test Autocorrelation Autocorrelation function ACF Autocovariance at lag K Autoregression Autoregressive form Autoregressive moving average Box Jenkins analysis Canonical correlation Characteristic equation Cointegration Cointegration rank Cointegration relationship Cointegrating vector

    Common trend Correlogram Covariance stationary Data generating process

    Linearly indeterministic

    component
    Moving average Nonstationary process Partial autocorrelation Phillips Perron test Random walk Random walk with drift Sample periodogram Spectral density function Stationarity Square summable Superconsistent Trend stationary Unit root Univariate time series White noise Wold decomposition Yule Walker equations

    DGP
    Dickey Fuller test Equilibrium error Ergodic Error correction model Fourier transform Fractional integration Frequency domain Identi cation Innovation Integrated process Integrated of order one Invertibility Lag window Linearly deterministic

    component

    Exercises 1 Find the autocorrelations and partial autocorrelations for the MA 2 process t vt 1 vt 1 2 vt 2 2 3 Carry out the ADF test for a unit root in the bond yield data of Example 20 1 Using the macroeconomic data in Appendix Table F5 1 estimate by least squares the parameters of the model ct 0 1 yt 2 ct 1 3 ct 2 t where ct is the log of real consumption and yt is the log of real disposable income

    Greene 50240

    book

    June 27 2002

    21 11

    662

    CHAPTER 20 Time Series Models

    4 5 6 7

    a Use the Breusch and Pagan test to examine the residuals for autocorrelation b Is the estimated equation stable What is the characteristic equation for the autoregressive part of this model What are the roots of the characteristic equation using your estimated parameters c What is your implied estimate of the short run impact multiplier for change in yt on ct Compute the estimated long run multiplier Verify the result in 20 10 Show the Yule Walker equations for an ARMA 1 1 process Carry out an ADF test for a unit root in the rate of in ation using the subset of the data in Table F5 1 since 1974 1 This is the rst quarter after the oil shock of 1973 Estimate the parameters of the model in Example 15 1 using two stage least squares Obtain the residuals from the two equations Do these residuals appear to be white noise series Based on your ndings what do you conclude about the speci cation of the model

    Greene 50240

    book

    June 27 2002

    22 39

    21

    MODELS FOR DISCRETE CHOICE

    Q
    21 1 INTRODUCTION There are many settings in which the economic outcome we seek to model is a discrete choice among a set of alternatives rather than a continuous measure of some activity Consider for example modeling labor force participation the decision of whether or not to make a major purchase or the decision of which candidate to vote for in an election For the rst of these examples intuition would suggest that factors such as age education marital status number of children and some economic data would be relevant in explaining whether an individual chooses to seek work or not in a given period But something is obviously lacking if this example is treated as the same sort of regression model we used to analyze consumption or the costs of production or the movements of exchange rates In this chapter we shall examine a variety of what have come to be known as qualitative response QR models There are numerous different types that apply in different situations What they have in common is that they are models in which the dependent variable is an indicator of a discrete choice such as a yes or no decision In general conventional regression methods are inappropriate in these cases This chapter is a lengthy but far from complete survey of topics in estimating QR models Almost none of these models can be consistently estimated with linear regression methods Therefore readers interested in the mechanics of estimation may want to review the material in Appendices D and E before continuing In most cases the method of estimation is maximum likelihood The various properties of maximum likelihood estimators are discussed in Chapter 17 We shall assume throughout this chapter that the necessary conditions behind the optimality properties of maximum likelihood estimators are met and therefore we will not derive or establish these properties speci cally for the QR models Detailed proofs for most of these models can be found in surveys by Amemiya 1981 McFadden 1984 Maddala 1983 and Dhrymes 1984 Additional commentary on some of the issues of interest in the contemporary literature is given by Maddala and Flores Lagunes 2001 21 2 DISCRETE CHOICE MODELS

    The general class of models we shall consider are those for which the dependent variable takes values 0 1 2 In a few cases the values will themselves be meaningful as in the following 1 Number of patents y 0 1 2 These are count data
    663

    Greene 50240

    book

    June 27 2002

    22 39

    664

    CHAPTER 21 Models for Discrete Choice

    In most of the cases we shall study the values taken by the dependent variables are merely a coding for some qualitative outcome Some examples are as follows 2 3 Labor force participation We equate no with 0 and yes with 1 These decisions are qualitative choices The 0 1 coding is a mere convenience Opinions of a certain type of legislation Let 0 represent strongly opposed 1 opposed 2 neutral 3 support and 4 strongly support These numbers are rankings and the values chosen are not quantitative but merely an ordering The difference between the outcomes represented by 1 and 0 is not necessarily the same as that between 2 and 1 The occupational eld chosen by an individual Let 0 be clerk 1 engineer 2 lawyer 3 politician and so on These data are merely categories giving neither a ranking nor a count Consumer choice among alternative shopping areas This case has the same characteristics as example 4 but the appropriate model is a bit different These two examples will differ in the extent to which the choice is based on characteristics of the individual which are probably dominant in the occupational choice as opposed to attributes of the choices which is likely the more important consideration in the choice of shopping venue

    4

    5

    None of these situations lends themselves readily to our familiar type of regression analysis Nonetheless in each case we can construct models that link the decision or outcome to a set of factors at least in the spirit of regression Our approach will be to analyze each of them in the general framework of probability models Prob event j occurs Prob Y j F relevant effects parameters 21 1

    The study of qualitative choice focuses on appropriate speci cation estimation and use of models for the probabilities of events where in most cases the event is an individual s choice among a set of alternatives
    Example 21 1 Labor Force Participation Model

    In Example 4 3 we estimated an earnings equation for the subsample of 428 married women who participated in the formal labor market taken from a full sample of 753 observations The semilog earnings equation is of the form ln earnings 1 2 age 3 age2 4 education 5 kids where earnings is hourly wage times hours worked education is measured in years of schooling and kids is a binary variable which equals one if there are children under 18 in the household What of the other 325 individuals The underlying labor supply model described a market in which labor force participation was the outcome of a market process whereby the demanders of labor services were willing to offer a wage based on expected marginal product and individuals themselves made a decision whether or not to accept the offer depending on whether it exceeded their own reservation wage The rst of these depends on among other things education while the second we assume depends on such variables as age the presence of children in the household other sources of income husband s and marginal tax rates on labor income The sample we used to t the earnings equation contains data on all these other variables The models considered in this chapter would be appropriate for modeling the outcome yi 1 if in the labor force and 0 if not

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    665

    21 3

    MODELS FOR BINARY CHOICE

    Models for explaining a binary 0 1 dependent variable typically arise in two contexts In many cases the analyst is essentially interested in a regressionlike model of the sort considered in Chapters 2 to 9 With data on the variable of interest and a set of covariates the analyst is interested in specifying a relationship between the former and the latter more or less along the lines of the models we have already studied The relationship between voting behavior and income is typical In other cases the binary choice model arises in the context of a model in which the nature of the observed data dictate the special treatment of a binary choice model For example in a model of the demand for tickets for sporting events in which the variable of interest is number of tickets it could happen that the observation consists only of whether the sports facility was lled to capacity demand greater than or equal to capacity so Y 1 or not Y 0 It will generally turn out that the models and techniques used in both cases are the same Nonetheless it is useful to examine both of them
    21 3 1 THE REGRESSION APPROACH

    To focus ideas consider the model of labor force participation suggested in Example 21 1 1 The respondent either works or seeks work Y 1 or does not Y 0 in the period in which our survey is taken We believe that a set of factors such as age marital status education and work history gathered in a vector x explain the decision so that Prob Y 1 x F x Prob Y 0 x 1 F x 21 2

    The set of parameters re ects the impact of changes in x on the probability For example among the factors that might interest us is the marginal effect of marital status on the probability of labor force participation The problem at this point is to devise a suitable model for the right hand side of the equation One possibility is to retain the familiar linear regression F x x Since E y x F x we can construct the regression model y E y x y E y x x 21 3

    The linear probability model has a number of shortcomings A minor complication arises because is heteroscedastic in a way that depends on Since x must equal 0 or 1 equals either x or 1 x with probabilities 1 F and F respectively Thus you can easily show that Var x x 1 x 21 4

    We could manage this complication with an FGLS estimator in the fashion of Chapter 11 A more serious aw is that without some ad hoc tinkering with the disturbances we cannot be assured that the predictions from this model will truly look like probabilities
    1 Models

    for qualitative dependent variables can now be found in most disciplines in economics A frequent use is in labor economics in the analysis of microlevel data sets

    Greene 50240

    book

    June 27 2002

    22 39

    666

    CHAPTER 21 Models for Discrete Choice

    1 00

    0 75

    F x

    0 50

    0 25

    0 00 30
    FIGURE 21 1

    20

    10

    0 x

    10

    20

    30

    Model for a Probability

    We cannot constrain x to the 0 1 interval Such a model produces both nonsense probabilities and negative variances For these reasons the linear model is becoming less frequently used except as a basis for comparison to some other more appropriate models 2 Our requirement then is a model that will produce predictions consistent with the underlying theory in 21 1 For a given regressor vector we would expect
    x x

    lim Prob Y 1 x 1 lim Prob Y 1 x 0 21 5

    See Figure 21 1 In principle any proper continuous probability distribution de ned over the real line will suf ce The normal distribution has been used in many analyses giving rise to the probit model Prob Y 1 x The function
    2 The

    x

    t dt

    x

    21 6

    is a commonly used notation for the standard normal distribution

    linear model is not beyond redemption Aldrich and Nelson 1984 analyze the properties of the model at length Judge et al 1985 and Fomby Hill and Johnson 1984 give interesting discussions of the ways we may modify the model to force internal consistency But the xes are sample dependent and the resulting estimator such as it is may have no known sampling properties Additional discussion of weighted least squares appears in Amemiya 1977 and Mullahy 1990 Finally its shortcomings notwithstanding the linear probability model is applied by Caudill 1988 Heckman and MaCurdy 1985 and Heckman and Snyder 1997

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    667

    Partly because of its mathematical convenience the logistic distribution Prob Y 1 x ex 1 ex x 21 7

    has also been used in many applications We shall use the notation to indicate the logistic cumulative distribution function This model is called the logit model for reasons we shall discuss in the next section Both of these distributions have the familiar bell shape of symmetric distributions Other models which do not assume symmetry such as the Weibull model Prob Y 1 x exp exp x and complementary log log model Prob Y 1 x 1 exp exp x have also been employed Still other distributions have been suggested 3 but the probit and logit models are still the most common frameworks used in econometric applications The question of which distribution to use is a natural one The logistic distribution is similar to the normal except in the tails which are considerably heavier It more closely resembles a t distribution with seven degrees of freedom Therefore for intermediate values of x say between 1 2 and 1 2 the two distributions tend to give similar probabilities The logistic distribution tends to give larger probabilities to y 0 when x is extremely small and smaller probabilities to Y 0 when x is very large than the normal distribution It is dif cult to provide practical generalities on this basis however since they would require knowledge of We should expect different predictions from the two models however if the sample contains 1 very few responses Ys equal to 1 or very few nonresponses Ys equal to 0 and 2 very wide variation in an important independent variable particularly if 1 is also true There are practical reasons for favoring one or the other in some cases for mathematical convenience but it is dif cult to justify the choice of one distribution or another on theoretical grounds Amemiya 1981 discusses a number of related issues but as a general proposition the question is unresolved In most applications the choice between these two seems not to make much difference However as seen in the example below the symmetric and asymmetric distributions can give substantively different results and here the guidance on how to choose is unfortunately sparse The probability model is a regression E y x 0 1 F x 1 F x F x 21 8

    Whatever distribution is used it is important to note that the parameters of the model like those of any nonlinear regression model are not necessarily the marginal effects we are accustomed to analyzing In general E y x x
    3 See

    d F x d x

    f x

    21 9

    for example Maddala 1983 pp 27 32 Aldrich and Nelson 1984 and Greene 2001

    Greene 50240

    book

    June 27 2002

    22 39

    668

    CHAPTER 21 Models for Discrete Choice

    where f is the density function that corresponds to the cumulative distribution F For the normal distribution this result is E y x x x where t is the standard normal density For the logistic distribution d x ex d x 1 ex 2 Thus in the logit model E y x x 1 x 21 12 x It is obvious that these values will vary with the values of x In interpreting the estimated model it will be useful to calculate this value at say the means of the regressors and where necessary other pertinent values For convenience it is worth noting that the same scale factor applies to all the slopes in the model For computing marginal effects one can evaluate the expressions at the sample means of the data or evaluate the marginal effects at every observation and use the sample average of the individual marginal effects The functions are continuous with continuous rst derivatives so Theorem D 12 the Slutsky theorem and assuming that the data are well behaved a law of large numbers Theorems D 4 and D 5 apply in large samples these will give the same answer But that is not so in small or moderatesized samples Current practice favors averaging the individual marginal effects when it is possible to do so Another complication for computing marginal effects in a binary choice model arises because x will often include dummy variables for example a labor force participation equation will often contain a dummy variable for marital status Since the derivative is with respect to a small change it is not appropriate to apply 21 10 for the effect of a change in a dummy variable or change of state The appropriate marginal effect for a binary independent variable say d would be Marginal effect Prob Y 1 x d d 1 Prob Y 1 x d d 0 where x d denotes the means of all the other variables in the model Simply taking the derivative with respect to the binary variable as if it were continuous provides an approximation that is often surprisingly accurate In Example 21 3 the difference in the two probabilities for the probit model is 0 5702 0 1057 0 4645 whereas the derivative approximation reported below is 0 468 Nonetheless it might be optimistic to rely on this outcome We will revisit this computation in the examples and discussion to follow
    21 3 2 LATENT REGRESSION INDEX FUNCTION MODELS

    21 10

    x 1

    x

    21 11

    Discrete dependent variable models are often cast in the form of index function models We view the outcome of a discrete choice as a re ection of an underlying regression As an often cited example consider the decision to make a large purchase The theory states that the consumer makes a marginal bene t marginal cost calculation based on the utilities achieved by making the purchase and by not making the purchase and by

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    669

    using the money for something else We model the difference between bene t and cost as an unobserved variable y such that y x We assume that has mean zero and has either a standardized logistic with known variance 2 3 see 21 7 or a standard normal distribution with variance one see 21 6 We do not observe the net bene t of the purchase only whether it is made or not Therefore our observation is y 1 y 0 if y 0 if y 0

    In this formulation x is called the index function Two aspects of this construction merit our attention First the assumption of known variance of is an innocent normalization Suppose the variance of is scaled by an unrestricted parameter 2 The latent regression will be y x But y x is the same model with the same data The observed data will be unchanged y is still 0 or 1 depending only on the sign of y not on its scale This means that there is no information about in the data so it cannot be estimated Second the assumption of zero for the threshold is likewise innocent if the model contains a constant term and not if it does not 4 Let a be the supposed nonzero threshold and be an unknown constant term and for the present x and contain the rest of the index not including the constant term Then the probability that y equals one is Prob y a x Prob x a x Prob a x 0 x Since is unknown the difference a remains an unknown parameter With the two normalizations Prob y 0 x Prob x x If the distribution is symmetric as are the normal and logistic then Prob y 0 x Prob x x F x which provides an underlying structural model for the probability
    Example 21 2 Structural Equations for a Probit Model

    Nakosteen and Zimmer 1980 analyze a model of migration based on the following structure 5 For individual i the market wage that can be earned at the present location is y x p p p Variables in the equation include age sex race growth in employment and growth in per capita income If the individual migrates to a new location then his or her market wage
    4 Unless 5A

    there is some compelling reason binomial probability models should not be estimated without constant terms number of other studies have also used variants of this basic formulation Some important examples are Willis and Rosen 1979 and Robinson and Tomes 1982 The study by Tunali 1986 examined in Example 21 5 is another example The now standard approach in which participation equals one if wage offer xw w w minus reservation wage xr r r is positive is also used in Fernandez and Rodriguez Poo 1997 Brock and Durlauf 2000 describe a number of models and situations involving individual behavior that give rise to binary choice models

    Greene 50240

    book

    June 27 2002

    22 39

    670

    CHAPTER 21 Models for Discrete Choice

    would be
    ym xm m

    Migration however entails costs that are related both to the individual and to the labor market C z u Costs of moving are related to whether the individual is self employed and whether that person recently changed his or her industry of employment They migrate if the bene t ym y is greater than the cost C The net bene t of moving is p
    M ym y C p

    xm x p z m p u w Since M is unobservable we cannot treat this equation as an ordinary regression The individual either moves or does not After the fact we observe only ym if the individual has moved or y if he or she has not But we do observe that M 1 for a move and M 0 for no p move If the disturbances are normally distributed then the probit model we analyzed earlier is produced Logistic disturbances produce the logit model instead
    21 3 3 RANDOM UTILITY MODELS

    An alternative interpretation of data on individual choices is provided by the random utility model Suppose that in the Nakosteen Zimmer framework ym and yp represent the individual s utility of two choices which we might denote U a and U b For another example U a might be the utility of rental housing and U b that of home ownership The observed choice between the two reveals which one provides the greater utility but not the unobservable utilities Hence the observed indicator equals 1 if U a U b and 0 if U a U b A common formulation is the linear random utility model U a x a a and U b x b b 21 13 Then if we denote by Y 1 the consumer s choice of alternative a we have Prob Y 1 x Prob U a U b Prob x a a x b b 0 x Prob x a b a b 0 x Prob x 0 x once again 21 14

    21 4

    ESTIMATION AND INFERENCE IN BINARY CHOICE MODELS

    With the exception of the linear probability model estimation of binary choice models is usually based on the method of maximum likelihood Each observation is treated as a single draw from a Bernoulli distribution binomial with one draw The model with success probability F x and independent observations leads to the joint probability

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    671

    or likelihood function Prob Y1 y1 Y2 y2 Yn yn X
    yi 0

    1 F xi
    yi 1

    F xi

    21 15

    where X denotes xi i 1 n The likelihood function for a sample of n observations can be conveniently written as
    n

    L data
    i 1

    F xi yi 1 F xi 1 yi

    21 16

    Taking logs we obtain
    n

    ln L
    i 1

    yi ln F xi 1 yi ln 1 F xi 6

    21 17

    The likelihood equations are ln L
    n i 1

    yi fi fi 1 yi xi 0 Fi 1 Fi

    21 18

    where fi is the density dFi d xi In 21 18 and later we will use the subscript i to indicate that the function has an argument xi The choice of a particular form for Fi leads to the empirical model Unless we are using the linear probability model the likelihood equations in 21 18 will be nonlinear and require an iterative solution All of the models we have seen thus far are relatively straightforward to analyze For the logit model by inserting 21 7 and 21 11 in 21 18 we get after a bit of manipulation the likelihood equations ln L
    n

    yi
    i 1

    i xi

    0

    21 19

    Note that if xi contains a constant term the rst order conditions imply that the average of the predicted probabilities must equal the proportion of ones in the sample 7 This implication also bears some similarity to the least squares normal equations if we view the term yi i as a residual 8 For the normal distribution the log likelihood is ln L
    yi 0

    ln 1

    xi
    yi 1

    ln xi

    21 20

    The rst order conditions for maximizing L are ln L i xi 1 i y 0
    i

    i
    yi 1 i

    xi
    yi 0

    i0 xi
    yi 1

    i1 xi

    6 If the distribution is symmetric as the normal and logistic are then 1 F x

    simpli cation Let q 2 y 1 Then ln L

    i

    ln F qi xi See 21 21

    F x There is a further

    7 The

    same result holds for the linear probability model Although regularly observed in practice the result has not been veri ed for the probit model sort of construction arises in many models The rst derivative of the log likelihood with respect to the constant term produces the generalized residual in many settings See for example Chesher Lancaster and Irish 1985 and the equivalent result for the tobit model in Section 20 3 5

    8 This

    Greene 50240

    book

    June 27 2002

    22 39

    672

    CHAPTER 21 Models for Discrete Choice

    Using the device suggested in footnote 6 we can reduce this to log L
    n i 1

    qi qi xi xi qi xi

    n

    i xi 0
    i 1

    21 21

    where qi 2 yi 1 The actual second derivatives for the logit model are quite simple H 2 ln L
    i 1 i



    i xi xi

    21 22

    Since the second derivatives do not involve the random variable yi Newton s method is also the method of scoring for the logit model Note that the Hessian is always negative de nite so the log likelihood is globally concave Newton s method will usually converge to the maximum of the log likelihood in just a few iterations unless the data are especially badly conditioned The computation is slightly more involved for the probit model A useful simpli cation is obtained by using the variable yi xi i that is de ned in 21 21 The second derivatives can be obtained using the result that for any z d z dz z z Then for the probit model H 2 ln L
    n

    i i xi xi xi
    i 1

    21 23

    This matrix is also negative de nite for all values of The proof is less obvious than for the logit model 9 It suf ces to note that the scalar part in the summation is Var x 1 when y 1 and Var x 1 when y 0 The unconditional variance is one Since truncation always reduces variance see Theorem 22 3 in both cases the variance is between zero and one so the value is negative 10 The asymptotic covariance matrix for the maximum likelihood estimator can be estimated by using the inverse of the Hessian evaluated at the maximum likelihood estimates There are also two other estimators available The Berndt Hall Hall and Hausman estimator see 17 18 and Example 17 4 would be
    n

    B
    i 1

    gi2 xi xi

    where gi yi i for the logit model see 21 19 and gi i for the probit model see 21 21 The third estimator would be based on the expected value of the Hessian As we saw earlier the Hessian for the logit model does not involve yi so H E H But because i is a function of yi see 21 21 this result is not true for the probit model Amemiya 1981 showed that for the probit model E 2 ln L
    n


    probit i 1

    0i i 1 xi xi

    21 24

    Once again the scalar part of the expression is always negative see 21 23 and note that 0i is always negative and i 1 is always positive The estimator of the asymptotic
    9 See 10 See

    for example Amemiya 1985 pp 273 274 and Maddala 1983 p 63 Johnson and Kotz 1993 and Heckman 1979 We will make repeated use of this result in Chapter 22

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    673

    covariance matrix for the maximum likelihood estimator is then the negative inverse of whatever matrix is used to estimate the expected Hessian Since the actual Hessian is generally used for the iterations this option is the usual choice As we shall see below though for certain hypothesis tests the BHHH estimator is a more convenient choice In some studies e g Boyes Hoffman and Low 1989 Greene 1992 the mix of ones and zeros in the observed sample of the dependent variable is deliberately skewed in favor of one outcome or the other to achieve a more balanced sample than random sampling would produce The sampling is said to be choice based In the studies noted the dependent variable measured the occurrence of loan default which is a relatively uncommon occurrence To enrich the sample observations with y 1 default were oversampled Intuition should suggest correctly that the bias in the sample should be transmitted to the parameter estimates which will be estimated so as to mimic the sample not the population which is known to be different Manski and Lerman 1977 derived the weighted endogenous sampling maximum likelihood WESML estimator for this situation The estimator requires that the true population proportions 1 and 0 be known Let p1 and p0 be the sample proportions of ones and zeros Then the estimator is obtained by maximizing a weighted log likelihood
    n

    ln L
    i 1

    wi ln F qi xi

    where wi yi 1 p1 1 yi 0 p0 Note that wi takes only two different values The derivatives and the Hessian are likewise weighted A nal correction is needed after estimation the appropriate estimator of the asymptotic covariance matrix is the sandwich estimator discussed in the next section H 1 BH 1 with weighted B and H instead of B or H alone The weights are not squared in computing B 11
    21 4 1 ROBUST COVARIANCE MATRIX ESTIMATION

    The probit maximum likelihood estimator is often labeled a quasi maximum likelihood estimator QMLE in view of the possibility that the normal probability model might be misspeci ed White s 1982a robust sandwich estimator for the asymptotic covariance matrix of the QMLE see Section 17 9 for discussion Est Asy Var H 1 B H 1 has been used in a number of recent studies based on the probit model e g Fernandez and Rodriguez Poo 1997 Horowitz 1993 and Blundell Laisney and Lechner 1993 If the probit model is correctly speci ed then plim 1 n B plim 1 n H and either single matrix will suf ce so the robustness issue is moot of course On the other hand the probit Q maximum likelihood estimator is not consistent in the presence of any form of heteroscedasticity unmeasured heterogeneity omitted variables even if they are orthogonal to the included ones nonlinearity of the functional form of the index or an error in the distributional assumption with some narrow exceptions
    11 WESML and the choice based sampling estimator are not the free lunch they may appear to be That which

    the biased sampling does the weighting undoes It is common for the end result to be very large standard errors which might be viewed as unfortunate insofar as the purpose of the biased sampling was to balance the data precisely to avoid this problem

    Greene 50240

    book

    June 27 2002

    22 39

    674

    CHAPTER 21 Models for Discrete Choice

    as described by Ruud 1986 Thus in almost any case the sandwich estimator provides an appropriate asymptotic covariance matrix for an estimator that is biased in an unknown direction White raises this issue explicitly although it seems to receive little attention in the literature it is the consistency of the QMLE for the parameters of interest in a wide range of situations which insures its usefulness as the basis for robust estimation techniques 1982a p 4 His very useful result is that if the quasi maximum likelihood estimator converges to a probability limit then the sandwich estimator can under certain circumstances be used to estimate the asymptotic covariance matrix of that estimator But there is no guarantee that the QMLE will converge to anything interesting or useful Simply computing a robust covariance matrix for an otherwise inconsistent estimator does not give it redemption Consequently the virtue of a robust covariance matrix in this setting is unclear
    21 4 2 MARGINAL EFFECTS

    The predicted probabilities F x F and the estimated marginal effects f x are nonlinear functions of the parameter estimates To compute standard errors we f can use the linear approximation approach delta method discussed in Section 5 2 4 For the predicted probabilities Asy Var F F V F where V Asy Var The estimated asymptotic covariance matrix of can be any of the three described Then the derivative vector is earlier Let z x F d F dz z f x Combining terms gives Asy Var F f 2 x Vx which depends of course on the particular x vector used This results is useful when a marginal effect is computed for a dummy variable In that case the estimated effect is F F d 1 F d 0 The asymptotic variance would be Asy Var F F V F where F f 1 x d 1 f 0 x d 0 21 26 21 25

    For the other marginal effects let f Then Asy Var V

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    675

    TABLE 21 1

    Estimated Probability Models
    Linear Logistic Coef cient Slope Probit Coef cient Slope Weibull Coef cient Slope

    Variable

    Coef cient Slope

    Constant GPA TUCE PSI f x

    1 498 0 464 0 464 0 010 0 010 0 379 0 379 1 000

    13 021 2 826 0 534 0 095 0 018 2 379 0 499 0 189

    7 452 1 626 0 533 0 052 0 017 1 426 0 468 0 328

    10 631 2 293 0 477 0 041 0 009 1 562 0 325 0 208

    The matrix of derivatives is f d f dz z f I d f dz x

    For the probit model d f dz z so Asy Var 2 I x x V I x x For the logit model f 1 so d f 1 2 dz Collecting terms we obtain Asy Var 1 2 I 1 2 x V I 1 2 x d dz 1 2 1

    As before the value obtained will depend on the x vector used
    Example 21 3 Probability Models

    The data listed in Appendix Table F21 1 were taken from a study by Spector and Mazzeo 1980 which examined whether a new method of teaching economics the Personalized System of Instruction PSI signi cantly in uenced performance in later economics courses The dependent variable used in our application is GRADE which indicates the whether a student s grade in an intermediate macroeconomics course was higher than that in the principles course The other variables are GPA their grade point average TUCE the score on a pretest that indicates entering knowledge of the material and PSI the binary variable indicator of whether the student was exposed to the new teaching method Spector and Mazzeo s speci c equation was somewhat different from the one estimated here Table 21 1 presents four sets of parameter estimates The slope parameters and derivatives were computed for four probability models linear probit logit and Weibull The last three sets of estimates are computed by maximizing the appropriate log likelihood function Estimation is discussed in the next section so standard errors are not presented here The scale factor given in the last row is the density function evaluated at the means of the variables Also note that the slope given for PSI is the derivative not the change in the function with PSI changed from zero to one with other variables held constant If one looked only at the coef cient estimates then it would be natural to conclude that the four models had produced radically different estimates But a comparison of the columns of slopes shows that this conclusion is clearly wrong The models are very similar in fact the logit and probit models results are nearly identical The data used in this example are only moderately unbalanced between 0s and 1s for the dependent variable 21 and 11 As such we might expect similar results for the probit

    Greene 50240

    book

    June 27 2002

    22 39

    676

    CHAPTER 21 Models for Discrete Choice

    and logit models 12 One indicator is a comparison of the coef cients In view of the different variances of the distributions one for the normal and 2 3 for the logistic we might expect to obtain comparable estimates by multiplying the probit coef cients by 3 1 8 Amemiya 1981 found through trial and error that scaling by 1 6 instead produced better results This proportionality result is frequently cited The result in 21 9 may help to explain the nding The index x is not the random variable See Section 21 3 2 The marginal effect in the probit model for say xk is x p pk whereas that for the logit is 1 l k The subscripts p and l are for probit and logit Amemiya suggests that his approximation works best at the center of the distribution where F 0 5 or x 0 for either distribution Suppose it is Then 0 0 3989 and 0 1 0 0 25 If the marginal effects are to be the same then 0 3989 pk 0 25 l k or l k 1 6 pk which is the regularity observed by Amemiya Note though that as we depart from the center of the distribution the relationship will move away from 1 6 Since the logistic density descends more slowly than the normal for unbalanced samples such as ours the ratio of the logit coef cients to the probit coef cients will tend to be larger than 1 6 The ratios for the ones in Table 21 1 are closer to 1 7 than 1 6 The computation of the derivatives of the conditional mean function is useful when the variable in question is continuous and often produces a reasonable approximation for a dummy variable Another way to analyze the effect of a dummy variable on the whole distribution is to compute Prob Y 1 over the range of x using the sample estimates and with the two values of the binary variable Using the coef cients from the probit model in Table 21 1 we have the following probabilities as a function of GPA at the mean of TUCE PSI 0 Prob GRADE 1 PSI 1 Prob GRADE 1 7 452 1 626GPA 0 052 21 938 7 452 1 626GPA 0 052 21 938 1 426

    Figure 21 2 shows these two functions plotted over the range of GRADE observed in the sample 2 0 to 4 0 The marginal effect of PSI is the difference between the two functions which ranges from only about 0 06 at GPA 2 to about 0 50 at GPA of 3 5 This effect shows that the probability that a student s grade will increase after exposure to PSI is far greater for students with high GPAs than for those with low GPAs At the sample mean of GPA of 3 117 the effect of PSI on the probability is 0 465 The simple derivative calculation of 21 9 is given in Table 21 1 the estimate is 0 468 But of course this calculation does not show the wide range of differences displayed in Figure 21 2 Table 21 2 presents the estimated coef cients and marginal effects for the probit and logit models in Table 21 1 In both cases the asymptotic covariance matrix is computed from the negative inverse of the actual Hessian of the log likelihood The standard errors for the estimated marginal effect of PSI are computed using 21 25 and 21 26 since PSI is a binary variable In comparison the simple derivatives produce estimates and standard errors of 0 449 0 181 for the logit model and 0 464 0 188 for the probit model These differ only slightly from the results given in the table

    21 4 3

    HYPOTHESIS TESTS

    For testing hypotheses about the coef cients the full menu of procedures is available The simplest method for a single restriction would be based on the usual t tests using the standard errors from the information matrix Using the normal distribution of the estimator we would use the standard normal table rather than the t table for critical points For more involved restrictions it is possible to use the Wald test For a set of

    12 One might be tempted in this case to suggest an asymmetric distribution for the model such as the Weibull

    distribution However the asymmetry in the model to the extent that it is present at all refers to the values of not to the observed sample of values of the dependent variable

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    677

    1 0

    0 8 With PSI 1 0 6

    Prob Grade

    0 571

    0 4 Without PSI 0 2 0 106 0 2 0 2 5 3 0 3 117 GPA 3 5 4 0

    FIGURE 21 2

    Effect of PSI on Predicted Probabilities

    TABLE 21 2

    Estimated Coef cients and Standard Errors Standard Errors in Parentheses
    Logistic Probit t Ratio Coef cient t Ratio Slope t Ratio Coef cient t Ratio Slope

    Variable

    Constant GPA TUCE PSI log likelihood

    13 021 4 931 2 826 1 263 0 095 0 142 2 379 1 065

    2 641 2 238



    2 252 0 685 2 521

    0 534 0 237 0 672 0 018 0 026 2 234 0 456 0 181 12 890

    7 452 2 542 1 626 0 694 0 052 0 084 1 426 0 595

    2 931 2 343



    2 294 0 626 2 727

    0 533 0 232 0 617 0 017 0 027 2 397 0 464 0 170 12 819

    restrictions R q the statistic is W R q R Est Asy Var R 1 R q For example for testing the hypothesis that a subset of the coef cients say the last M are zero the Wald statistic uses R 0 I M and q 0 Collecting terms we nd that the test statistic for this hypothesis is W M V 1 M M 21 27

    where the subscript M indicates the subvector or submatrix corresponding to the M variables and V is the estimated asymptotic covariance matrix of

    Greene 50240

    book

    June 27 2002

    22 39

    678

    CHAPTER 21 Models for Discrete Choice

    Likelihood ratio and Lagrange multiplier statistics can also be computed The likelihood ratio statistic is LR 2 ln LR ln LU where LR and LU are the log likelihood functions evaluated at the restricted and unrestricted estimates respectively A common test which is similar to the F test that all the slopes in a regression are zero is the likelihood ratio test that all the slope coef cients in the probit or logit model are zero For this test the constant term remains unrestricted In this case the restricted log likelihood is the same for both probit and logit models ln L0 n P ln P 1 P ln 1 P 21 28

    where P is the proportion of the observations that have dependent variable equal to 1 It might be tempting to use the likelihood ratio test to choose between the probit and logit models But there is no restriction involved and the test is not valid for this purpose To underscore the point there is nothing in its construction to prevent the chi squared statistic for this test from being negative The Lagrange multiplier test statistic is LM g Vg where g is the rst derivatives of the unrestricted model evaluated at the restricted parameter vector and V is any of the three estimators of the asymptotic covariance matrix of the maximum likelihood estimator once again computed using the restricted estimates Davidson and MacKinnon 1984 nd evidence that E H is the best of the three estimators to use which gives
    n n 1 n

    LM
    i 1

    gi xi
    i 1

    E hi xi xi
    i 1

    gi xi



    21 29

    where E hi is de ned in 21 22 for the logit model and in 21 24 for the probit model For the logit model when the hypothesis is that all the slopes are zero LM nR2 where R2 is the uncentered coef cient of determination in the regression of yi y on x i and y is the proportion of 1s in the sample An alternative formulation based on the BHHH estimator which we developed in Section 17 5 3 is also convenient For any of the models probit logit Weibull etc the rst derivative vector can be written as ln L
    n

    gi xi X Gi
    i 1

    where G n n diag g1 g2 gn and i is an n 1 column of 1s The BHHH estimator of the Hessian is X G GX so the LM statistic based on this estimator is LM n 1 i GX X G GX 1 X G i nR2 i n 21 30

    where R2 is the uncentered coef cient of determination in a regression of a column of i ones on the rst derivatives of the logs of the individual probabilities All the statistics listed here are asymptotically equivalent and under the null hypothesis of the restricted model have limiting chi squared distributions with degrees of freedom equal to the number of restrictions being tested We consider some examples below

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice 21 4 4 SPECIFICATION TESTS FOR BINARY CHOICE MODELS

    679

    In the linear regression model we considered two important speci cation problems the effect of omitted variables and the effect of heteroscedasticity In the classical model y X1 1 X2 2 when least squares estimates b1 are computed omitting X2 E b1 1 X1 X1 1 X1 X2 2 Unless X1 and X2 are orthogonal or 2 0 b1 is biased If we ignore heteroscedasticity then although the least squares estimator is still unbiased and consistent it is inef cient and the usual estimate of its sampling covariance matrix is inappropriate Yatchew and Griliches 1984 have examined these same issues in the setting of the probit and logit models Their general results are far more pessimistic In the context of a binary choice model they nd the following 1 If x2 is omitted from a model containing x1 and x2 i e 2 0 then plim 1 c1 1 c2 2 where c1 and c2 are complicated functions of the unknown parameters The implication is that even if the omitted variable is uncorrelated with the included one the coef cient on the included variable will be inconsistent If the disturbances in the underlying regression are heteroscedastic then the maximum likelihood estimators are inconsistent and the covariance matrix is inappropriate

    2

    The second result is particularly troubling because the probit model is most often used with microeconomic data which are frequently heteroscedastic Any of the three methods of hypothesis testing discussed above can be used to analyze these speci cation problems The Lagrange multiplier test has the advantage that it can be carried out using the estimates from the restricted model which sometimes brings a large saving in computational effort This situation is especially true for the test for heteroscedasticity 13 To reiterate the Lagrange multiplier statistic is computed as follows Let the null hypothesis H0 be a speci cation of the model and let H1 be the alternative For example H0 might specify that only variables x1 appear in the model whereas H1 might specify that x2 appears in the model as well The statistic is LM g0 V 1 g0 0 where g0 is the vector of derivatives of the log likelihood as speci ed by H1 but evaluated at the maximum likelihood estimator of the parameters assuming that H0 is true and V 1 is any of the three consistent estimators of the asymptotic variance matrix of the 0 maximum likelihood estimator under H1 also computed using the maximum likelihood estimators based on H0 The statistic is asymptotically distributed as chi squared with degrees of freedom equal to the number of restrictions
    13 The

    results in this section are based on Davidson and MacKinnon 1984 and Engle 1984 A symposium on the subject of speci cation tests in discrete choice models is Blundell 1987

    Greene 50240

    book

    June 27 2002

    22 39

    680

    CHAPTER 21 Models for Discrete Choice 21 4 4 a Omitted Variables

    The hypothesis to be tested is H0 y 1 x1 21 31

    H1 y 1 x1 2 x2

    so the test is of the null hypothesis that 2 0 The Lagrange multiplier test would be carried out as follows 1 2 Estimate the model in H0 by maximum likelihood The restricted coef cient vector is 1 0 Let x be the compound vector x1 x2

    The statistic is then computed according to 21 29 or 21 30 It is noteworthy that in this case as in many others the Lagrange multiplier is the coef cient of determination in a regression
    21 4 4 b Heteroscedasticity

    We use the general formulation analyzed by Harvey 1976 14 Var exp z 2 15 This model can be applied equally to the probit and logit models We will derive the results speci cally for the probit model the logit model is essentially the same Thus y x Var x z exp z 2 21 32

    The presence of heteroscedasticity makes some care necessary in interpreting the coef cients for a variable wk that could be in x or z or both Prob Y 1 x z x k x k wk exp z exp z Only the rst second term applies if wk appears only in x z This implies that the simple coef cient may differ radically from the effect that is of interest in the estimated model This effect is clearly visible in the example below The log likelihood is
    n

    ln L
    i 1

    yi ln F

    xi exp zi

    1 yi ln 1 F

    xi exp zi



    21 33

    14 See 15 See

    Knapp and Seaks 1992 for an application Other formulations are suggested by Fisher and Nagin 1981 Hausman and Wise 1978 and Horowitz 1993 Section 11 7 1

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    681

    To be able to estimate all the parameters z cannot have a constant term The derivatives are ln L ln L
    n i 1 n i 1

    fi yi Fi exp zi xi Fi 1 Fi fi yi Fi exp zi zi xi Fi 1 Fi 21 34

    which implies a dif cult log likelihood to maximize But if the model is estimated assuming that 0 then we can easily test for homoscedasticity Let wi xi xi zi 21 35

    computed at the maximum likelihood estimator assuming that 0 Then 21 29 or 21 30 can be used as usual for the Lagrange multiplier statistic Davidson and MacKinnon carried out a Monte Carlo study to examine the true sizes and power functions of these tests As might be expected the test for omitted variables is relatively powerful The test for heteroscedasticity may well pick up some other form of misspeci cation however including perhaps the simple omission of z from the index function so its power may be problematic It is perhaps not surprising that the same problem arose earlier in our test for heteroscedasticity in the linear regression model
    Example 21 4 Speci cation Tests in a Labor Force Participation Model

    Using the data described in Example 21 1 we t a probit model for labor force participation based on the speci cation Prob LFP 1 F constant age age2 family income education kids For these data P 428 753 0 568393 The restricted all slopes equal zero free constant term log likelihood is 325 ln 325 753 428 ln 428 753 514 8732 The unrestricted log likelihood for the probit model is 490 84784 The chi squared statistic is therefore 48 05072 The critical value from the chi squared distribution with 5 degrees of freedom is 11 07 so the joint hypothesis that the coef cients on age age2 family income and kids are all zero is rejected Consider the alternative hypothesis that the constant term and the coef cients on age age2 family income and education are the same whether kids equals one or zero against the alternative that an altogether different equation applies for the two groups of women those with kids 1 and those with kids 0 To test this hypothesis we would use a counterpart to the Chow test of Section 7 4 and Example 7 6 The restricted model in this instance would be based on the pooled data set of all 753 observations The log likelihood for the pooled model which has a constant term age age2 family income and education is 496 8663 The log likelihoods for this model based on the 428 observations with kids 1 and the 325 observations with kids 0 are 347 87441 and 141 60501 respectively The log likelihood for the unrestricted model with separate coef cient vectors is thus the sum 489 47942 The chi squared statistic for testing the ve restrictions of the pooled model is twice the difference LR 2 489 47942 496 8663 14 7738 The 95 percent critical value from the chi squared distribution with 5 degrees of freedom is 11 07 is so at this signi cance level the hypothesis that the constant terms and the coef cients on age age2 family income and education are the same is rejected The 99 critical value is 15 09

    Greene 50240

    book

    June 27 2002

    22 39

    682

    CHAPTER 21 Models for Discrete Choice

    TABLE 21 3

    Estimated Coef cients
    Estimate Std Er Marg Effect Estimate St Er Marg Effect

    Constant Age Age2 Income Education Kids Kids Income Log L Correct Preds

    1 2 3 4 5 6 1 2

    4 157 1 402 0 185 0 0660 0 0079 0 0027 0 0024 0 00077 0 0458 0 0421 0 0180 0 0165 0 0982 0 0230 0 0385 0 0090 0 449 0 131 0 171 0 0480 0 000 0 000 490 8478 0s 106 1s 357

    6 030 2 498 0 264 0 118 0 0088 0 00251 0 0036 0 0014 0 424 0 222 0 0552 0 0240 0 140 0 0519 0 0289 0 00869 0 879 0 303 0 167 0 0779 0 141 0 324 0 313 0 123 487 6356 0s 115 1s 358

    Marginal effect and estimated standard error include both mean and variance effects

    Table 21 3 presents estimates of the probit model now with a correction for heteroscedasticity of the form Var i exp 1 kids 2 family income The three tests for homoscedasticity give LR 2 487 6356 490 8478 6 424 LM 2 236 based on the BHHH estimator Wald 6 533 2 restrictions The 99 percent critical value for two restrictions is 5 99 so the LM statistic con icts with the other two
    21 4 4 c A Speci cation Test for Nonnested Models Testing for the Distribution

    Whether the logit or probit form or some third alternative is the best speci cation for a discrete choice model is a perennial question Since the distributions are not nested within some higher level model testing for an answer is always problematic Building on the logic of the PE test discussed in Section 9 4 3 Silva 2001 has suggested a score test which may be useful in this regard The statistic is intended for a variety of discrete choice models but is especially convenient for binary choice models which are based on a common single index formulation the probability model is Prob yi 1 xi F xi Let 1 denote Model 1 based on parameter vector and 2 denote Model 2 with parameter vector and let Model 1 be the null speci cation while Model 2 is the alternative A super model which combines two alternatives would have likelihood function L 1 L1 y X L2 y X 1 1 dz z 1 L1 z X L2 z X

    Note that integration is used generically here since y is discrete The two mixing parameters are and Silva derives an LM test in this context for the hypothesis 0 for any particular value of The case when 0 is of particular interest As he notes it is the nonlinear counterpart to the Cox test we examined in Section 8 3 4 For related results see Pesaran and Pesaran 1993 Davidson and MacKinnon 1984 1993

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    683

    Orme 1994 and Weeks 1996 For binary choice models Silva suggests the following procedure as one of three computational strategies Compute the parameters of the competing models by maximum likelihood and obtain predicted probabilities for yi 1 Pim where i denotes the observation and m 1 or 2 for the two models 15 The individual observations on the density for the null model f im are also required The new variable P1 1 Pi1 Pi1 1 Pi2 zi 0 i ln 1 2 f i Pi 1 Pi1 is then computed Finally Model 1 is then reestimated with zi 0 added as an additional independent variable A test of the hypothesis that its coef cient is zero is equivalent to a test of the null hypothesis that 1 which favors Model 1 Rejection of the hypothesis favors Model 2 Silva s preferred procedure is the same as this based on zi 1 Pi2 Pi1 1 f i

    As suggested by the citations above tests of this sort have a long history in this literature Silva s simulation study for the Cox test 0 and his score test 1 suggest that the power of the test is quite erratic
    21 4 5 MEASURING GOODNESS OF FIT

    There have been many t measures suggested for QR models 16 At a minimum one should report the maximized value of the log likelihood function ln L Since the hypothesis that all the slopes in the model are zero is often interesting the log likelihood computed with only a constant term ln L0 see 21 28 should also be reported An analog to the R2 in a conventional regression is McFadden s 1974 likelihood ratio index LRI 1 ln L ln L0

    This measure has an intuitive appeal in that it is bounded by zero and one If all the slope coef cients are zero then it equals zero There is no way to make LRI equal 1 although one can come close If Fi is always one when y equals one and zero when y equals zero then ln L equals zero the log of one and LRI equals one It has been suggested that this nding is indicative of a perfect t and that LRI increases as the t of the model improves To a degree this point is true see the analysis in Section 21 6 6 Unfortunately the values between zero and one have no natural interpretation If F xi is a proper pdf then even with many regressors the model cannot t perfectly unless xi goes to or As a practical matter it does happen But when it does it indicates a aw in the model not a good t If the range of one of the independent variables contains a value say x such that the sign of x x predicts y perfectly
    15 His

    conjecture about the computational burden is probably overstated given that modern software offers a variety of binary choice models essentially in push button fashion and Lerman 1985 Kay and Little 1986 Veall and Zimmermann 1992 Zavoina and McKelvey 1975 Efron 1978 and Cramer 1999 A survey of techniques appears in Windmeijer 1995

    16 See for example Cragg and Uhler 1970 Amemiya 1981 Maddala 1983 McFadden 1974 Ben Akiva

    Greene 50240

    book

    June 27 2002

    22 39

    684

    CHAPTER 21 Models for Discrete Choice

    and vice versa then the model will become a perfect predictor This result also holds in general if the sign of x gives a perfect predictor for some vector 17 For example one might mistakenly include as a regressor a dummy variables that is identical or nearly so to the dependent variable In this case the maximization procedure will break down precisely because x is diverging during the iterations See McKenzie 1998 for an application and discussion Of course this situation is not at all what we had in mind for a good t Other t measures have been suggested Ben Akiva and Lerman 1985 and Kay and Little 1986 suggested a t measure that is keyed to the prediction rule R2 BL 1 n
    n

    yi F i 1 yi 1 F i
    i 1

    which is the average probability of correct prediction by the prediction rule The dif culty in this computation is that in unbalanced samples the less frequent outcome will usually be predicted vary badly by the standard procedure and this measure does not pick that point up Cramer 1999 has suggested an alternative measure that directly measures this failure average F yi 1 average F yi 0 average 1 F yi 0 average 1 F yi 1 Cramer s measure heavily penalizes the incorrect predictions and because each proportion is taken within the subsample it is not unduly in uenced by the large proportionate size of the group of more frequent outcomes Some of the other proposed t measures are Efron s 1978
    2 REf 1 n 2 i 1 yi pi n 2 i 1 yi y

    Veall and Zimmermann s 1992
    2 RVZ

    1 LRI

    LRI

    n 2 log L0

    and Zavoina and McKelvey s 1975 R2 MZ
    n i 1 xi n i 1 xi

    n

    x 2 x 2

    The last of these measures corresponds to the regression variation divided by the total variation in the latent index function model where the disturbance variance is 2 1 The values of several of these statistics are given with the model results in Example 21 4 for illustration A useful summary of the predictive ability of the model is a 2 2 table of the hits and misses of a prediction rule such as y 1
    17 See

    if F F and 0 otherwise

    21 36

    McFadden 1984 and Amemiya 1985 If this condition holds then gradient methods will nd that

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    685

    The usual threshold value is 0 5 on the basis that we should predict a one if the model says a one is more likely than a zero It is important not to place too much emphasis on this measure of goodness of t however Consider for example the naive predictor y 1 if P 0 5 and 0 otherwise 21 37

    where P is the simple proportion of ones in the sample This rule will always predict correctly 100P percent of the observations which means that the naive model does not have zero t In fact if the proportion of ones in the sample is very high it is possible to construct examples in which the second model will generate more correct predictions than the rst Once again this aw is not in the model it is a aw in the t measure 18 The important element to bear in mind is that the coef cients of the estimated model are not chosen so as to maximize this or any other t measure as they are in the linear regression model where b maximizes R2 The maximum score estimator discussed below addresses this issue directly Another consideration is that 0 5 although the usual choice may not be a very good value to use for the threshold If the sample is unbalanced that is has many more ones than zeros or vice versa then by this prediction rule it might never predict a one or zero To consider an example suppose that in a sample of 10 000 observations only 1000 have Y 1 We know that the average predicted probability in the sample will be 0 10 As such it may require an extreme con guration of regressors even to produce an F of 0 2 to say nothing of 0 5 In such a setting the prediction rule may fail every time to predict when Y 1 The obvious adjustment is to reduce F Of course this adjustment comes at a cost If we reduce the threshold F so as to predict y 1 more often then we will increase the number of correct classi cations of observations that do have y 1 but we will also increase the number of times that we incorrectly classify as ones observations that have y 0 19 In general any prediction rule of the form in 21 36 will make two types of errors It will incorrectly classify zeros as ones and ones as zeros In practice these errors need not be symmetric in the costs that result For example in a credit scoring model see Boyes Hoffman and Low 1989 incorrectly classifying an applicant as a bad risk is not the same as incorrectly classifying a bad risk as a good one Changing F will always reduce the probability of one type of error while increasing the probability of the other There is no correct answer as to the best value to choose It depends on the setting and on the criterion function upon which the prediction rule depends The likelihood ratio index and Veall and Zimmermann s modi cation of it are obviously related to the likelihood ratio statistic for testing the hypothesis that the coef cient vector is zero Efron s and Cramer s measures listed above are oriented more toward the relationship between the tted probabilities and the actual values Efron s and Cramer s statistics are usefully tied to the standard prediction rule y 1 F 0 5 The McKelvey and Zavoina measure is an analog to the regression coef cient of determination based on the underlying regression y x Whether these have a close relationship to any type of t in the familiar sense is a question that needs to be studied In some cases
    18 See 19 The

    Amemiya 1981

    technique of discriminant analysis is used to build a procedure around this consideration In this setting we consider not only the number of correct and incorrect classi cations but the cost of each type of misclassi cation

    Greene 50240

    book

    June 27 2002

    22 39

    686

    CHAPTER 21 Models for Discrete Choice

    it appears so But the maximum likelihood estimator on which all the t measures are based is not chosen so as to maximize a tting criterion based on prediction of y as it is in the classical regression which maximizes R2 It is chosen to maximize the joint density of the observed dependent variables It remains an interesting question for research whether tting y well or obtaining good parameter estimates is a preferable estimation criterion Evidently they need not be the same thing
    Example 21 5 Prediction with a Probit Model

    Tunali 1986 estimated a probit model in a study of migration subsequent remigration and earnings for a large sample of observations of male members of households in Turkey Among his results he reports the summary shown below for a probit model The estimated model is highly signi cant with a likelihood ratio test of the hypothesis that the coef cients 16 of them are zero based on a chi squared value of 69 with 16 degrees of freedom 20 The model predicts 491 of 690 or 71 2 percent of the observations correctly although the likelihood ratio index is only 0 083 A naive model which always predicts that y 0 because P 0 5 predicts 487 of 690 or 70 6 percent of the observations correctly This result is hardly suggestive of no t The maximum likelihood estimator produces several signi cant in uences on the probability but makes only four more correct predictions than the naive predictor 21 Predicted
    D 0 D 1 Total

    Actual

    D 0 D 1 Total

    471 183 654

    16 20 36

    487 203 690

    21 4 6

    ANALYSIS OF PROPORTIONS DATA

    Data for the analysis of binary responses will be in one of two forms The data we have considered thus far are individual each observation consists of yi xi the actual response of an individual and associated regressor vector Grouped data usually consist of counts or proportions Grouped data are obtained by observing the response of ni individuals all of whom have the same xi The observed dependent variable will consist of the proportion Pi of the ni individuals i j who respond with yi j 1 An observation is thus ni Pi xi i 1 N Election data are typical 22 In the grouped data setting it is possible to use regression methods as well as maximum likelihood procedures to analyze the relationship between Pi and xi The observed Pi is an estimate of the population quantity i F xi If we treat this problem as a simple one of sampling from a Bernoulli population then from basic statistics we have Pi F xi i i i
    20 This

    view actually understates slightly the signi cance of his model because the preceding predictions are based on a bivariate model The likelihood ratio test fails to reject the hypothesis that a univariate model applies however is also noteworthy that nearly all the correct predictions of the maximum likelihood estimator are the zeros It hits only 10 percent of the ones in the sample

    21 It

    earliest work on probit modeling involved applications of grouped data in laboratory experiments Each observation consisted of ni subjects receiving dosage xi of some treatment such as an insecticide and a proportion Pi responding to the treatment usually by dying Finney 1971 and Cox 1970 are useful and early surveys of this literature

    22 The

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    687

    where E i 0 Var i i 1 i ni 21 38

    This heteroscedastic regression format suggests that the parameters could be estimated by a nonlinear weighted least squares regression But there is a simpler way to proceed Since the function F xi is strictly monotonic it has an inverse See Figure 21 1 Consider then a Taylor series approximation to this function around the point i 0 that is around the point Pi i F 1 Pi F 1 i i F 1 i But F 1 i xi and dF 1 i 1 1 1 d i F F f i i so F 1 Pi xi i f i d F 1 i Pi i d i

    This equation produces a heteroscedastic linear regression F 1 Pi zi xi ui where E ui xi 0 and Var ui xi F i 1 F i ni f i 2 21 39

    The inverse function for the logistic model is particularly easy to obtain If i then ln i 1 i xi exp xi 1 exp xi

    This function is called the logit of i hence the name logit model For the normal distribution the inverse function 1 i called the normit of i must be approximated The usual approach is a ratio of polynomials 23 Weighted least squares regression based on 21 39 produces the minimum chisquared estimator MCSE of Since the weights are functions of the unknown parameters a two step procedure is called for As always simple least squares at the rst step produces consistent but inef cient estimates Then the estimated variances wi i 1 i ni i2

    23 See Abramovitz and Stegun 1971 and Section E 5 2 The function normit 5 is called the probit of

    Pi The term dates from the early days of this analysis when the avoidance of negative numbers was a simpli cation with considerable payoff

    Greene 50240

    book

    June 27 2002

    22 39

    688

    CHAPTER 21 Models for Discrete Choice

    for the probit model or wi 1 ni i 1 i

    for the logit model based on the rst step estimates can be used for weighted least squares 24 An iteration can then be set up
    n



    k 1


    i 1

    1 wi

    1

    n i 1

    xx k i i


    1 wi k

    xi F 1 i k

    where k indicates the kth iteration and indicates computation of the quantity at the current kth estimate of The MCSE has the same asymptotic properties as the maximum likelihood estimator at every step after the rst so in fact iteration is not necessary Although they have the same probability limit the MCSE is not algebraically the same as the MLE and in a nite sample they will differ numerically The log likelihood function for a binary choice model with grouped data is
    n

    ln L
    i 1

    ni Pi ln F xi 1 Pi ln 1 F xi

    The likelihood equation that de nes the maximum likelihood estimator is ln L
    n

    ni Pi
    i 1

    f xi f xi 1 Pi xi 0 F xi 1 F xi

    This equation closely resembles the solution for the individual data case which makes sense if we view the grouped observation as ni replications of an individual observation On the other hand it is clear on inspection that the solution to this set of equations will not be the same as the generalized weighted least squares estimator suggested in the previous paragraph For convenience de ne Fi F xi fi f xi and fi f z z xi d f z dz z xi The Hessian of the log likelihood is 2 ln L
    n

    ni
    i 1

    Pi

    fi Fi



    fi Fi

    2

    1 Pi

    fi 1 Fi



    fi 1 Fi

    2

    xi xi

    To evaluate the expectation of the Hessian we need only insert the expectation of the only stochastic element Pi which is E Pi xi Fi Then E 2 log L
    n

    ni fi
    i 1

    fi2 fi2 xi xi fi Fi 1 Fi

    n i 1

    ni fi2 xi xi Fi 1 Fi

    The asymptotic covariance matrix for the maximum likelihood estimator is the negative inverse of this matrix From 21 39 we see that it is exactly equal to Asy Var minimum 2 estimator X
    24 Simply

    1

    X 1

    using pi and f F 1 Pi might seem to be a simple expedient in computing the weights But this method would be analogous to using yi2 instead of an estimate of i2 in a heteroscedastic regression Fitted probabilities and for the probit model densities should be based on a consistent estimator of the parameters

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    689

    since the diagonal elements of 1 are precisely the values in brackets in the expression for the expected Hessian above We conclude that although the MCSE and the MLE for this model are numerically different they have the same asymptotic properties consistent and asymptotically normal the MCS estimator by virtue of the results of Chapter 10 the MLE by those in Chapter 17 and with asymptotic covariance matrix as previously given There is a complication in using the MCS estimator The FGLS estimator breaks down if any of the sample proportions equals one or zero A number of ad hoc patches have been suggested the one that seems to be most widely used is to add or subtract a small constant say 0 001 to or from the observed proportion when it is zero or one The familiar results in 21 38 also suggest that when the proportion is based on a large population the variance of the estimator can be exceedingly low This issue will resurface in surprisingly low standard errors and high t ratios in the weighted regression Unfortunately that is a consequence of the model 25 The same result will emerge in maximum likelihood estimation with grouped data

    21 5

    EXTENSIONS OF THE BINARY CHOICE MODEL

    Qualitative response models have been a growth industry in econometrics The recent literature particularly in the area of panel data analysis has produced a number of new techniques
    21 5 1 RANDOM AND FIXED EFFECTS MODELS FOR PANEL DATA

    The availability of high quality panel data sets on microeconomic behavior has maintained an interest in extending the models of Chapter 13 to binary and other discrete choice models In this section we will survey a few results from this rapidly growing literature The structural model for a possibly unbalanced panel of data would be written
    yit xi t it i 1 n t 1 Ti

    yit 1

    if yit 0 and 0 otherwise

    The second line of this de nition is often written yit 1 xi t it 0 to indicate a variable which equals one when the condition in parentheses is true and zero when it is not Ideally we would like to specify that it and is are freely correlated within a group but uncorrelated across groups But doing so will involve computing
    25 Whether

    the proportion should in fact be considered as a single observation from a distribution of proportions is a question that arises in all these cases It is unambiguous in the bioassay cases noted earlier But the issue is less clear with election data especially since in these cases the ni will represent most of if not all the potential respondents in location i rather than a random sample of respondents

    Greene 50240

    book

    June 27 2002

    22 39

    690

    CHAPTER 21 Models for Discrete Choice

    joint probabilities from a Ti variate distribution which is generally problematic 26 We will return to this issue below A more promising approach is an effects model
    yit xi t vit ui i 1 n t 1 Ti

    yit 1

    if yit 0 and 0 otherwise

    where as before see Section 13 4 ui is the unobserved individual speci c heterogeneity Once again we distinguish between random and xed effects models by the relationship between ui and xit The assumption that ui is unrelated to xit so that the conditional distribution f ui xit is not dependent on xit produces the random effects model Note that this places a restriction on the distribution of the heterogeneity If that distribution is unrestricted so that ui and xit may be correlated then we have what is called the xed effects model The distinction does not relate to any intrinsic characteristic of the effect itself As we shall see shortly this is a modeling framework that is fraught with dif culties and unconventional estimation problems Among them are estimation of the random effects model requires very strong assumptions about the heterogeneity the xed effects model encounters an incidental parameters problem that renders the maximum likelihood estimator inconsistent We begin with the random effects speci cation then consider xed effects and some semiparametric approaches that do not require the distinction We conclude with a brief look at dynamic models of state dependence 27
    21 5 1 a Random Effects Models

    A speci cation which has the same structure as the random effects model of Section 13 4 has been implemented by Butler and Mof tt 1982 We will sketch the derivation to suggest how random effects can be handled in discrete and limited dependent variable models such as this one Full details on estimation and inference may be found in Butler and Mof tt 1982 and Greene 1995a We will then examine some extensions of the Butler and Mof tt model The random effects model speci es it vit ui where vit and ui are independent random variables with E vit X 0 Cov vit v js X Var vit X 1 E ui X 0 Cov ui u j X Var ui X Cov vit u j X 0 for all i t j
    26 A

    if i j and t s 0 otherwise if i j 0 otherwise

    2 u

    limited information approach based on the GMM estimation method has been suggested by Avery Hansen and Hotz 1983 With recent advances in simulation based computation of multinormal integrals see Section E 5 6 some work on such a panel data estimator has appeared in the literature See for example Geweke Keane and Runkle 1994 1997 The GEE estimator of Diggle Liang and Zeger 1994 see also Liang and Zeger 1980 and Stata 2001 seems to be another possibility However in all these cases it must be remembered that the procedure speci es estimation of a correlation matrix for a Ti vector of unobserved variables based on a dependent variable which takes only two values We should not be too optimistic about this if Ti is even moderately large survey of some of these results is given by Hsiao 1996 Most of Hsiao 1996 is devoted to the linear regression model A number of studies speci cally focused on discrete choice models and panel data have appeared recently including Beck Epstein Jackman and O Halloran 2001 Arellano 2001 and Greene 2001

    27 A

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    691

    and X indicates all the exogenous data in the sample xit for all i and t 28 Then E it X 0
    2 2 2 Var it X v u 1 u

    and Corr it is X
    2 u 2 1 u

    2 The new free parameter is u 1 Recall that in the cross section case the probability associated with an observation is

    P yi xi

    Ui Li

    f i d i Li Ui xi if yi 0 and xi if yi 1

    This simpli es to 2 yi 1 xi for the normal distribution and 2 yi 1 xi for the logit model In the fully general case with an unrestricted covariance matrix the contribution of group i to the likelihood would be the joint probability for all Ti observations Li P yi 1 yi Ti X
    Ui Ti Li Ti



    Ui 1 Li 1

    f i 1 i 2 i Ti d i 1 d i 2 d i Ti 21 40

    The integration of the joint density as it stands is impractical in most cases The special nature of the random effects model allows a simpli cation however We can obtain the joint density of the vit s by integrating ui out of the joint density of i 1 i Ti ui which is f i 1 i Ti ui f i 1 i Ti ui f ui So f i 1 i 2 i Ti


    f i 1 i 2 i Ti ui f ui dui

    The advantage of this form is that conditioned on ui the i s are independent so f i 1 i 2 i Ti Inserting this result in 21 40 produces Li P yi 1 yi Ti X
    Ui Ti Li Ti Ti t 1

    f it ui f ui dui



    Ui 1 Li 1

    Ti t 1

    f it ui f ui dui d i 1 d i 2 d i Ti

    This may not look like much simpli cation but in fact it is Since the ranges of integration are independent we may change the order of integration Li P yi 1 yi Ti X
    28 See



    Ui Ti Li Ti



    Ui 1 Ti Li 1 t 1

    f it ui d i 1 d i 2 d i Ti f ui dui

    Wooldridge 1999 for discussion of this assumption

    Greene 50240

    book

    June 27 2002

    22 39

    692

    CHAPTER 21 Models for Discrete Choice

    Conditioned on the common ui the s are independent so the term in square brackets is just the product of the individual probabilities We can write this as Li P yi 1 yi Ti X
    Ti t 1 Uit Lit

    f it ui d it

    f ui dui

    Now consider the individual densities in the product Conditioned on ui these are the now familiar probabilities for the individual observations computed now at xi t ui This produces a general model for random effects for the binary choice model Collecting all the terms we have reduced it to Li P yi 1 yi Ti X
    Ti

    Prob Yit yit xi t ui f ui dui
    t 1

    It remains to specify the distributions but the important result thus far is that the entire computation requires only one dimensional integration The inner probabilities may be any of the models we have considered so far such as probit logit Weibull and so on The intricate part remaining is to determine how to do the outer integration Butler and Mof tt s method assuming that ui is normally distributed is fairly straightforward so we will consider it rst We will then consider some other possibilities For the probit model the individual probabilities inside the product would be qit xi t ui where is the standard normal CDF and qit 2 yit 1 For the logit model would be replaced with the logistic probability For the present treat the entire function as a function of ui g ui The integral is then Li


    Let ri ui u 2 Then ui u 2 ri ri and dui dri Making the change of variable produces 1 Li


    u 2

    1

    e



    u2 i 2 2 u

    g ui dui

    e ri g ri dri
    2

    Several constants cancel out of the fractions Returning to our probit or logit model we now have 1 Li


    e ri

    Ti

    2

    qit xi t ri dri
    t 1

    The payoff to all this manipulation is that this likelihood function involves only onedimensional integrals The inner integrals are the CDF of the standard normal distribution or the logistic or extreme value distributions which are simple to obtain The function is amenable to Gauss Hermite quadrature for computation Gauss Hermite quadrature is discussed in Section E 5 4 Assembling all the pieces we obtain the approximation to the log likelihood
    n

    ln LH
    i 1

    1 ln

    H

    Ti

    wh qit xi t zh
    h 1 t 1

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    693

    where H is the number of points for the quadrature and wh and zh are the weights and nodes for the quadrature Maximizing this function remains a complex problem But it is made quite feasible by the transformations which reduce the integration to one dimension This technique for the probit model has been incorporated in most contemporary econometric software and can be easily extended to other models The rst and second derivatives are likewise complex but still computable by quadrature An estimate of u is obtained from result u 2 and a standard the error can be obtained by dividing that for by 2 The model may be adapted to the logit or any other formulation just by changing the CDF in the preceding equation from to the logistic CDF or the other appropriate CDF The hypothesis of no cross period correlation can be tested in principle using any of the three classical testing procedures we have discussed to examine the statistical signi cance of the estimated u A number of authors have found the Butler and Mof tt formulation to be a satisfactory compromise between a fully unrestricted model and the cross sectional variant that ignores the correlation altogether A recent application that includes both group and time effects is Tauchen Witte and Griesinger s 1994 study of arrests and criminal behavior The Butler and Mof tt approach has been criticized for the restriction of equal correlation across periods But it does have a compelling virtue that the model can be ef ciently estimated even with fairly large Ti using conventional computational methods See Greene 1995a pp 425 431 A remaining problem with the Butler and Mof tt speci cation is its assumption of normality In general other distributions are problematic because of the dif culty of nding either a closed form for the integral or a satisfactory method of approximating the integral An alternative approach which allows some exibility is the method of maximum simulated likelihood MSL which was discussed in Section 17 8 The transformed likelihood we derived above is an expectation Li
    Ti Ti

    Prob Yit yit xi t ui f ui dui
    t 1

    Eui
    t 1

    Prob Yit yit xi t ui

    This expectation can be approximated by simulation rather than quadrature First let now denote the scale parameter in the distribution of ui This would be u for a normal distribution for example or some other scaling for the logistic or uniform distribution Then write the term in the likelihood function as
    Ti

    Li Eui
    t 1

    F yit xi t ui Eui h ui

    The function is smooth continuous and continuously differentiable If this expectation is nite then the conditions of the law of large numbers should apply which would mean that for a sample of observations ui 1 ui R 1 plim R
    R

    h uir Eu h ui
    r 1

    Greene 50240

    book

    June 27 2002

    22 39

    694

    CHAPTER 21 Models for Discrete Choice

    This suggests based on the results in Chapter 17 an alternative method of maximizing the log likelihood for the random effects model A sample of person speci c draws from the population ui can be generated with a random number generator For the Butler and Mof tt model with normally distributed ui the simulated log likelihood function is
    n

    ln LSimulated
    i 1

    ln

    1 R

    R r 1

    Ti

    F qit xi t u uir
    t 1



    This function is maximized with respect and u Note that in the preceding as in the quadrature approximated log likelihood the model can be based on a probit logit or any other functional form desired There is an additional degree of exibility in this approach The Hermite quadrature approach is essentially limited by its functional form to the normal distribution But in the simulation approach uir can come from some other distribution For example it might be believed that the dispersion of the heterogeneity is greater than implied by a normal distribution The logistic distribution might be preferable A random sample from the logistic distribution can be created by sampling wi 1 wi R from the standard uniform 0 1 distribution then uir ln wir 1 wir Other distributions such as the uniform itself are also possible We have examined two approaches to estimation of a probit model with random effects GMM estimation is another possibility Avery Hansen and Hotz 1983 Bertschek and Lechner 1998 and Inkmann 2000 examine this approach the latter two offer some comparison with the quadrature and simulation based estimators considered here Our applications in the following Examples 16 5 17 10 and 21 6 use the Bertschek and Lechner data The preceding opens another possibility The random effects model can be cast as a model with a random constant term
    yit i x 1 it 1 it i 1 n t 1 Ti yit 1 if yit 0 and 0 otherwise

    where i u ui This is simply a reinterpretation of the model we just analyzed We might however now extend this formulation to the full parameter vector The resulting structure is
    yit xi t i it i 1 n t 1 Ti

    yit 1

    if yit 0 and 0 otherwise

    where i ui where is a nonnegative de nite diagonal matrix some of its diagonal elements could be zero for nonrandom parameters The method of estimation is essentially the same as before The simulated log likelihood is now
    n

    ln LSimulated
    i 1

    ln

    1 R

    R r 1

    Ti

    F qit xi t uir
    t 1



    The simulation now involves R draws from the multivariate distribution of u Since the draws are uncorrelated is diagonal this is essentially the same estimation problem as the random effects model considered previously This model is estimated in Example 17 10 Example 16 5 presents a similar model that assumes that the distribution of i is discrete rather than continuous

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice 21 5 1 b Fixed Effects Models

    695

    The xed effects model is
    yit i dit xi t it i 1 n t 1 Ti

    yit 1

    if yit 0 and 0 otherwise

    where dit is a dummy variable which takes the value one for individual i and zero otherwise For convenience we have rede ned xit to be the nonconstant variables in the model The parameters to be estimated are the K elements of and the n individual constant terms Before we consider the several virtues and shortcomings of this model we consider the practical aspects of estimation of what are possibly a huge number of parameters n K n is not limited here and could be in the thousands in a typical application The log likelihood function for the xed effects model is
    n Ti

    ln L
    i 1 t 1

    ln P yit i xi t

    where P is the probability of the observed outcome for example qit i xi t for the probit model or qit i xi t for the logit model What follows can be extended to any index function model but for the present we ll con ne our attention to symmetric distributions such as the normal and logistic so that the probability can be conveniently written as Prob Yit yit xit P qit i xi t It will be convenient to let zit i xi t so Prob Yit yit xit P qit zit In our previous application of this model in the linear regression case we found that estimation of the parameters was made possible by a transformation of the data to deviations from group means which eliminated the person speci c constants from the estimator See Section 13 3 2 Save for the special case discussed below that will not be possible here so that if one desires to estimate the parameters of this model it will be necessary actually to compute the possibly huge number of constant terms at the same time This has been widely viewed as a practical obstacle to estimation of this model because of the need to invert a potentially large second derivatives matrix but this is a misconception See e g Maddala 1987 p 317 The likelihood equations for this model are ln L i and ln L
    n Ti Ti t 1

    qit f qit zit P qit zit

    Ti

    git gii 0
    t 1

    i 1 t 1

    qit f qit zit xit P qit zit

    Ti

    git xit 0
    t 1

    where f is the density that corresponds to P For our two familiar models git qit qit zit qit zit for the normal and qit 1 qit zit for the logistic Note that for these distributions git is always negative when yit is zero and always positive when yit equals one The use of qit as in the preceding assumes the distribution is symmetric For asymmetric distributions such as the Weibull git and hit would be more complicated

    Greene 50240

    book

    June 27 2002

    22 39

    696

    CHAPTER 21 Models for Discrete Choice

    but the central results would be the same The second derivatives matrix is 2 ln L i2 2 ln L i 2 ln L
    Ti t 1 Ti

    f qit zit P qit zit hit xit

    f qit zit P qit zit

    2

    Ti


    t 1

    hit hii 0

    t 1 n Ti

    hit xit xi t H a negative semide nite matrix
    i 1 t 1

    Note that the leading qit falls out of the second derivatives since in each appear2 ance since qit 1 The derivatives of the densities with respect to their arguments are qit zit qit zit for the normal distribution and 1 2 qit zit f qit zit for the logistic In both cases hit is negative for all values of qit zit The likelihood equations are a large system but the solution turns out to be surprisingly straightforward See Greene 2001 By using the formula for the partitioned inverse we nd that the K K submatrix of the inverse of the Hessian that corresponds to which would provide the asymptotic covariance matrix for the MLE is
    n

    H



    Ti t 1 Ti


    i 1 n

    1 hit xit xi t hii

    Ti

    Ti

    1

    hit xit
    t 1 1 t 1

    hit xi t
    Ti t 1


    i 1 t 1

    hit xit xi xit xi

    where xi

    hit xit

    hii



    Note the striking similarity to the result we had for the xed effects model in the linear case By assembling the Hessian as a partitioned matrix for and the full vector of constant terms then using A 66b and the de nitions above to isolate one diagonal element we nd H i i 1 xi H xi hii

    Once again the result has the same format as its counterpart in the linear model In principle the negatives of these would be the estimators of the asymptotic variances of the maximum likelihood estimators Asymptotic properties in this model are problematic as we consider below All of these can be computed quite easily once the parameter estimates are in hand so that in fact practical estimation of the model is not really the obstacle This must be quali ed however Looking at the likelihood equation for a constant term it is clear that if yit is the same in every period then there is no solution For example if yit 1 in every period then ln L i must be positive so it cannot be equated to zero with nite coef cients Such groups would have to be removed from the sample in order to t this model It is shown in Greene 2001 in spite of the potentially large number of parameters in the model Newton s method can be used with the following iteration

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    697

    which uses only the K K matrix computed above and a few K 1 vectors
    n

    s 1 s
    i 1

    Ti

    1

    n i 1

    Ti

    hit xit xi xit xi
    t 1 t 1 s

    git xit xi

    s and

    s 1 s gii hii xi

    s

    29

    This is a large amount of computation involving many summations but it is linear in the number of parameters and does not involve any n n matrices The problems with the xed effects estimator are statistical not practical 30 The estimator relies on Ti increasing for the constant terms to be consistent in essence each i is estimated with Ti observations But in this setting not only is Ti xed it is likely to be quite small As such the estimators of the constant terms are not consistent not because they converge to something other than what they are trying to estimate but because they do not converge at all The estimator of is a function of the estimators of which means that the MLE of is not consistent either This is the incidental parameters problem See Neyman and Scott 1948 and Lancaster 2000 There is as well a small sample small Ti bias in the estimators How serious this bias is remains a question in the literature Two pieces of received wisdom are Hsiao s 1986 results for a binary logit model and Heckman and MaCurdy s 1980 results for the probit model Hsiao found that for Ti 2 the bias in the MLE of is 100 percent which is extremely pessimistic Heckman and MaCurdy found in a Monte Carlo study that in samples of n 100 and T 8 the bias appeared to be on the order of 10 percent which is substantive but certainly less severe than Hsiao s results suggest The xed effects approach does have some appeal in that it does not require an assumption of orthogonality of the independent variables and the heterogeneity An ongoing pursuit in the literature is concerned with the severity of the tradeoff of this virtue against the incidental parameters problem Some commentary on this issue appears in Arellano 2001 Why did the incidental parameters problem arise here and not in the linear regression model Recall that estimation in the regression model was based on the deviations from group means not the original data as it is here The result we exploited there was that although f yit Xi is a function of i f yit Xi yi is not a function of i and we used the latter in estimation of In that setting yi is a minimal suf cient statistic for i Suf cient statistics are available for a few distributions that we will examine but not for the probit model They are available for the logit model as we now examine
    29 Similar

    results appear in Prentice and Gloeckler 1978 who attribute it to Rao 1973 and Chamberlain

    1983
    30 See

    Vytlacil Aakvik and Heckman 2002 Chamberlain 1980 1984 Newey 1994 Bover and Arellano 1997 and Chen 1998 for some extensions of parametric forms of the binary choice models with xed effects

    Greene 50240

    book

    June 27 2002

    22 39

    698

    CHAPTER 21 Models for Discrete Choice

    A xed effects binary logit model is Prob yit 1 xit e i xi t 1 e i xi t

    The unconditional likelihood for the nT independent observations is L
    i t

    Fit yit 1 Fit 1 yit

    Chamberlain 1980 following Rasch 1960 and Anderson 1970 observed that the conditional likelihood function
    n Ti

    Lc
    i 1

    Prob Yi 1 yi 1 Yi 2 yi 2 Yi Ti yi Ti
    t 1

    yit



    is free of the incidental parameters i The joint likelihood for each set of Ti observations conditioned on the number of ones in the set is
    Ti

    Prob Yi 1 yi 1 Yi 2 yi 2 Yi Ti yi Ti
    t 1

    yit data



    exp
    t dit Si

    Ti t 1

    yit xi t
    Ti t 1

    exp

    dit xi t



    The function in the denominator is summed over the set of all Tii different sequences S of Ti zeros and ones that have the same sum as Si tTi 1 yit 31 Consider the example of Ti 2 The unconditional likelihood is L
    i

    Prob Yi 1 yi 1 Prob Yi 2 yi 2

    For each pair of observations we have these possibilities 1 2 yi 1 0 and yi 2 0 Prob 0 0 sum 0 1 yi 1 1 and yi 2 1 Prob 1 1 sum 2 1

    The i th term in Lc for either of these is just one so they contribute nothing to the conditional likelihood function 32 When we take logs these terms and these observations will drop out But suppose that yi 1 0 and yi 2 1 Then 3 Prob 0 1 sum 1 Prob 0 1 and sum 1 Prob 0 1 Prob sum 1 Prob 0 1 Prob 1 0

    31 The

    enumeration of all these computations stands to be quite a burden see Arellano 2000 p 47 or Baltagi 1995 p 180 who citing Greene 1993 suggests that Ti 10 would be excessive In fact using a recursion suggested by Krailo and Pike 1984 the computation even with Ti up to 100 is routine in the probit model when we encountered this situation the individual constant term could not be estimated and the group was removed from the sample The same effect is at work here

    32 Recall

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    699

    Therefore for this pair of observations the conditional probability is e i xi 2 exi 2 1 e i xi 1 1 e i xi 2 x e i xi 2 1 e i 1 exi 2 e i xi 1 1 e i xi 2 1 e i xi 1 1 e i xi 2 1

    1 1 e i xi 1

    By conditioning on the sum of the two observations we have removed the heterogeneity Therefore we can construct the conditional likelihood function as the product of these terms for the pairs of observations for which the two observations are 0 1 Pairs of observations with one and zero are included analogously The product of the terms such as the preceding for those observation sets for which the sum is not zero or Ti constitutes the conditional likelihood Maximization of the resulting function is straightforward and may be done by conventional methods As in the linear regression model it is of some interest to test whether there is indeed heterogeneity With homogeneity i there is no unusual problem and the model can be estimated as usual as a logit model It is not possible to test the hypothesis using the likelihood ratio test however because the two likelihoods are not comparable The conditional likelihood is based on a restricted data set None of the usual tests of restrictions can be used because the individual effects are never actually estimated 33 Hausman s 1978 speci cation test is a natural one to use here however Under the null hypothesis of homogeneity both Chamberlain s conditional maximum likelihood estimator CMLE and the usual maximum likelihood estimator are consistent but Chamberlain s is inef cient It fails to use the information that i and it may not use all the data Under the alternative hypothesis the unconditional maximum likelihood estimator is inconsistent 34 whereas Chamberlain s estimator is consistent and ef cient The Hausman test can be based on the chi squared statistic 2 CML ML Var CML Var ML 1 CML ML The estimated covariance matrices are those computed for the two maximum likelihood estimators For the unconditional maximum likelihood estimator the row and column corresponding to the constant term are dropped A large value will cast doubt on the hypothesis of homogeneity There are K degrees of freedom for the test It is possible that the covariance matrix for the maximum likelihood estimator will be larger than that for the conditional maximum likelihood estimator If so then the difference matrix in brackets is assumed to be a zero matrix and the chi squared statistic is therefore zero

    33 This produces a dif culty for this estimator that is shared by the semiparametric estimators discussed in the

    next section Since the xed effects are not estimated it is not possible to compute probabilities or marginal effects with these estimated coef cients and it is a bit ambiguous what one can do with the results of the computations The brute force estimator that actually computes the individual effects might be preferable
    34 Hsaio

    1996 derives the result explicitly for some particular cases

    Greene 50240

    book

    June 27 2002

    22 39

    700

    CHAPTER 21 Models for Discrete Choice Example 21 6 Individual Effects in a Binary Choice Model

    To illustrate the xed and random effects estimators we continue the analyses of Examples 16 5 and 17 10 35 The binary dependent variable is yi t 1 if rm i realized a product innovation in year t and 0 if not The sample consists of 1 270 German rms observed for 5 years 1984 1988 Independent variables in the model that we formulated were xi t 1 constant xi t 2 log of sales xi t 3 relative size ratio of employment in business unit to employment in the industry xi t 4 ratio of industry imports to industry sales imports xi t 5 ratio of industry foreign direct investment to industry sales imports xi t 6 productivity ratio of industry value added to industry industry employment Latent class and random parameters models were t to these data in Examples 16 5 and 17 10 For this example we have dropped the two sector dummy variables as they are constant across periods This precludes estimation of the xed effects models Table 21 4 presents estimates of the probit and logit models with individual effects The differences across the models are quite large Note for example that the signs of the sales and FDI variables both of which are highly signi cant in the base case change sign in the xed effects model The random effects logit model is estimated by appending a normally distributed individual effect to the model and using the Butler and Mof tt method described earlier The evidence of heterogeneity in the data is quite substantial The simple likelihood ratio tests of either panel data form against the base case leads to rejection of the restricted model The xed effects logit model cannot be used for this test because it is based on the conditional log likelihood whereas the other two forms are based on unconditional likelihoods It was not possible to t the logit model with the full set of xed effects The relative size variable has some but not enough within group variation and the model became unstable after only a few iterations The Hausman statistic based on the logit estimates equals 19 59 The 95 percent critical value from the chi squared distribution with 5 degrees of freedom is 11 07 so based on the logit estimates we would reject the homogeneity restriction In this setting unlike in the linear model see Section 13 4 4 neither the probit nor the logit model provides a means of testing for whether the random or xed effects model is preferred
    21 5 2 SEMIPARAMETRIC ANALYSIS

    In his survey of qualitative response models Amemiya 1981 reports the following widely cited approximations for the linear probability LP model Over the range of probabilities of 30 to 70 percent LP 0 4 probit for the slopes LP 0 25 logit for the slopes 36
    35 The 36 An

    data are from by Bertschek and Lechner 1998 Description of the data appears in Example 16 5 and in the original paper additional 0 5 is added for the constant term in both models

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    701

    TABLE 21 4

    Estimated Panel Data Models Standard Errors in Parentheses Marginal Effects in Brackets
    Probit Base Random Fixed Base Logit Random Fixed

    Constant InSales

    RelSize

    Imports

    FDI

    Prod Ln L

    2 35 0 214 0 243 0 194 0 094 1 17 0 141 0 450 0 909 0 143 0 350 3 39 0 394 1 31 4 71 0 553 1 82 4134 86

    3 51 0 502 0 353 0 448 0 088 1 59 0 241 0 398 1 40 0 343 0 351 4 55 0 828 1 14 5 62 0 753 1 41 0 582 0 019 3546 01

    0 650 0 355 0 255 0 278 0 734 0 110 3 50 2 92 1 38 8 13 3 38 3 20 5 30 4 03 2 09 2086 26

    3 83 0 351 0 408 0 0323 0 097 2 16 0 272 0 517 1 49 0 232 0 356 5 75 0 705 1 37 9 33 1 13 2 29

    4128 98

    0 751 0 611 0 429 0 547 0 103 1 36 0 296 0 328 0 858 0 418 0 207 1 98 1 01 0 477 1 76 0 927 0 424 0 252 0 081 3545 84

    0 863 0 530 0 340 1 06 4 69 4 34 10 44 5 01 6 64 5 93 1388 51

    Aside from con rming our intuition that least squares approximates the nonlinear model and providing a quick comparison for the three models involved the practical usefulness of the formula is somewhat limited Still it is a striking result 37 A series of studies has focused on reasons why the least squares estimates should be proportional to the probit and logit estimates A related question concerns the problems associated with assuming that a probit model applies when in fact a logit model is appropriate or vice versa 38 The approximation would seem to suggest that with this type of misspeci cation we would once again obtain a scaled version of the correct coef cient vector Amemiya also reports the widely observed relationship logit 1 6 probit which follows from the results above Greene 1983 building on Goldberger 1981 nds that if the probit model is correctly speci ed and if the regressors are themselves joint normally distributed then the probability limit of the least squares estimator is a multiple of the true coef cient

    37 This result does not imply that it is useful to report 2 5 times the linear probability estimates with the probit

    estimates for comparability The linear probability estimates are already in the form of marginal effects whereas the probit coef cients must be scaled downward If the sample proportion happens to be close to 0 5 then the right scale factor will be roughly 1 0 5 0 3989 But the density falls rapidly as P moves away from 0 5
    38 See

    Ruud 1986 and Gourieroux et al 1987

    Greene 50240

    book

    June 27 2002

    22 39

    702

    CHAPTER 21 Models for Discrete Choice

    vector 39 Greene s result is useful only for the same purpose as Amemiya s quick correction of OLS Multivariate normality is obviously inconsistent with most applications For example nearly all applications include at least one dummy variable Ruud 1982 and Cheung and Goldberger 1984 however have shown that much weaker conditions than joint normality will produce the same proportionality result For a probit model Cheung and Goldberger require only that E x y be linear in y Several authors have built on these observations to pursue the issue of what circumstances will lead to proportionality results such as these Ruud 1986 and Stoker 1986 have extended them to a very wide class of models that goes well beyond those of Cheung and Goldberger Curiously enough Stoker s results rule out dummy variables but it is those for which the proportionality result seems to be most robust 40
    21 5 3 THE MAXIMUM SCORE ESTIMATOR MSCORE

    In Section 21 4 5 we discussed the issue of prediction rules for the probit and logit models In contrast to the linear regression model estimation of these binary choice models is not based on a tting rule such as the sum of squared residuals which is related to the t of the model to the data The maximum score estimator is based on a tting rule Maximize Sn 1 n
    n

    zi 1 2 sgn xi 41
    i 1

    The parameter is a preset quantile and zi 2 yi 1 So z 1 if y 0 If is set to 1 then the maximum score estimator chooses the to maximize the number of 2 times that the prediction has the same sign as z This result matches our prediction rule in 21 36 with F 0 5 So for 0 5 maximum score attempts to maximize the number of correct predictions Since the sign of x is the same for all positive multiples of the estimator is computed subject to the constraint that 1 Since there is no log likelihood function underlying the tting criterion there is no information matrix to provide a method of obtaining standard errors for the estimates Bootstrapping can used to provide at least some idea of the sampling variability of the estimator See Section E 4 The method proceeds as follows After the set of coef cients bn is computed R randomly drawn samples of m observations are drawn from the original data set with replacement The bootstrap sample size m may be less than or equal to n the sample size With each such sample the maximum score estimator is recomputed giving bm r Then the mean squared deviation matrix MSD b 1 R
    R

    bm r bn bm r bn
    b 1

    39 The 40 See

    scale factor is estimable with the sample data so under these assumptions a method of moments estimator is available Greene 1983

    41 See Manski 1975 1985 1986 and Manski and Thompson 1986 For extensions of this model see Horowitz

    1992 Charlier Melenberg and van Soest 1995 Kyriazidou 1997 and Lee 1996

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    703

    TABLE 21 5

    Maximum Score Estimator
    Maximum Score Estimate Mean Square Dev Estimate Probit Standard Error

    Constant 1 GPA 2 TUCE 3 PSI 4

    0 9317 0 3582 0 01513 0 05902 Actual

    0 1066 0 2152 0 02800 0 2749 Fitted 01 0 21 0 147

    7 4522 1 6260 0 05173 1 4264 Actual

    2 5420 0 6939 0 08389 0 5950 Fitted 01 0 18 3 138

    is computed The authors of the technique emphasize that this matrix is not a covariance matrix 42
    Example 21 7 The Maximum Score Estimator

    Table 21 5 presents maximum score estimates for Spector and Mazzeo s GRADE model using 0 5 Note that they are quite far removed from the probit estimates The estimates are extremely sensitive to the choice of Of course there is no meaningful comparison of the coef cients since the maximum score estimates are not the slopes of a conditional mean function The prediction performance of the model is also quite sensitive to but that is to be expected 43 As expected the maximum score estimator performs better than the probit estimator The score is precisely the number of correct predictions in the 2 2 table so the best that the probit model could possibly do is obtain the maximum score In this example it does not quite attain that maximum The literature awaits a comparison of the prediction performance of the probit logit parametric approaches and this semiparametric model The relevant scores for the two estimators are also given in the table

    Semiparametric approaches such as this one have the virtue that they do not make a possibly erroneous assumption about the underlying distribution On the other hand as seen in the example there is no guarantee that the estimator will outperform the fully parametric estimator One additional practical consideration is that semiparametric estimators such as this one are very computation intensive At present the maximum score estimator is not usable for more than roughly 15 coef cients and perhaps 1 500 to 2 000 observations 44 A third shortcoming of the approach is unfortunately inherent in
    42 Note that we are not yet agreed that b even converges to a meaningful vector since no underlying proban bility distribution as such has been assumed Once it is agreed that there is an underlying regression function at work then a meaningful set of asymptotic results including consistency can be developed Manski and Thompson 1986 and Kim and Pollard 1990 present a number of results Even so it has been shown that the bootstrap MSD matrix is useful for little more than descriptive purposes Horowitz s 1993 smoothed maximum score estimator replaces the discontinuous sgn xi in the MSCORE criterion with a continuous weighting function xi h where h is a bandwidth proportional to n 1 5 He argues that this estimator is an improvement over Manski s MSCORE estimator Its asymptotic distribution is very complicated and not useful for making inferences in applications Later in the same paragraph he argues There has been no theoretical investigation of the properties of the bootstrap in maximum score estimation 43 The

    criterion function for choosing b is not continuous and it has more than one optimum M E Bissey reported nding that the score function varies signi cantly between the local optima as well Personal correspondence to the author University of York 1995

    44 Communication

    from C Manski to the author The maximum score estimator has been implemented by Manski and Thompson 1986 and Greene 1995a

    Greene 50240

    book

    June 27 2002

    22 39

    704

    CHAPTER 21 Models for Discrete Choice

    its design The parametric assumptions of the probit or logit produce a large amount of information about the relationship between the response variable and the covariates In the nal analysis the marginal effects discussed earlier might well have been the primary objective of the study That information is lost here
    21 5 4 SEMIPARAMETRIC ESTIMATION

    The fully parametric probit and logit models remain by far the mainstays of empirical research on binary choice Fully nonparametric discrete choice models are fairly exotic and have made only limited inroads in the literature and much of that literature is theoretical e g Matzkin 1993 The primary obstacle to application is their paucity of interpretable results See Example 21 9 Of course one could argue on this basis that the rm results produced by the fully parametric models are merely fragile artifacts of the detailed speci cation not genuine re ections of some underlying truth In this connection see Manski 1995 But that orthodox view raises the question of what motivates the study to begin with and what one hopes to learn by embarking upon it The intent of model building to approximate reality so as to draw useful conclusions is hardly limited to the analysis of binary choices Semiparametric estimators represent a middle ground between these extreme views 45 The single index model of Klein and Spady 1993 has been used in several applications including Ger n 1996 Horowitz 1993 and Fernandez and Rodriguez Poo 1997 46 The single index formulation departs from a linear regression formulation E yi xi E yi xi Then Prob yi 1 xi F xi xi G xi where G is an unknown continuous distribution function whose range is 0 1 The function G is not speci ed a priori it is estimated along with the parameters Since G as well as is to be estimated a constant term is not identi ed essentially G provides the location for the index that would otherwise be provided by a constant The criterion function for estimation in which subscripts n denote estimators of their unsubscripted counterparts is ln Ln 1 n
    n

    yi ln Gn xi n 1 yi ln 1 Gn xi n
    i 1

    The estimator of the probability function Gn is computed at each iteration using a nonparametric kernel estimator of the density of x n we did this calculation in Section 16 4 For the Klein and Spady estimator the nonparametric regression
    45 Recent

    proposals for semiparametric estimators in addition to the one developed here include Lewbel 1997 2000 Lewbel and Honore 2001 and Altonji and Matzkin 2001 In spite of nearly 10 years of development this is a nascent literature The theoretical development tends to focus on root n consistent coef cient estimation in models which provide no means of computation of probabilities or marginal effects symposium on the subject is Hardle and Manski 1993

    46 A

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    705

    estimator is Gn zi ygn zi yi 1 ygn zi yi 1 1 y gn zi yi 0
    n

    where gn zi yi is the kernel estimate of the density of zi n xi This result is 1 gn zi yi 1 n yhn yj K
    j 1

    zi n x j hn



    gn zi yi 0 is obtained by replacing y with 1 y in the leading scalar and y j with 1 y j in the summation As before hn is the bandwidth There is no rm theory for choosing the kernel function or the bandwidth Both Horowitz and Ger n used the standard normal density Two different methods for choosing the bandwidth are suggested by them 47 Klein and Spady provide theoretical background for computing asymptotic standard errors
    Example 21 8 A Comparison of Binary Choice Estimators

    Ger n 1996 did an extensive analysis of several binary choice estimators the probit model Klein and Spady s single index model and Horowitz s smoothed maximum score estimator A fourth seminonparametric estimator was also examined but in the interest of brevity we con ne our attention to the three more widely used procedures The several models were all t to two data sets on labor force participation of married women one from Switzerland and one from Germany Variables included in the equation were our notation x1 a constant x2 age x3 age2 x4 education x5 number of young children x6 number of older children x7 log of yearly nonlabor income and x8 a dummy variable for permanent foreign resident Swiss data only Coef cient estimates for the models are not directly comparable We suggested in Example 21 3 that they could be made comparable by transforming them to marginal effects Neither MSCORE nor the single index model however produces a marginal effect which does suggest a question of interpretation The author obtained comparability by dividing all coef cients by the absolute value of the coef cient on x7 The set of normalized coef cients estimated for the Swiss data appears in Table 21 6 with estimated standard errors from Ger n s Table III shown in parentheses Given the very large differences in the models the agreement of the estimates is impressive A similar comparison of the same estimators with comparable concordance may be found in Horowitz 1993 p 56 In every case the standard error of the probit estimator is smaller than that of the others It is tempting to conclude that it is a more ef cient estimator but that is true only if the normal distribution assumed for the model is correct In any event the smaller standard error is the payoff to the sharper speci cation of the distribution This payoff could be viewed in much the same way that parametric restrictions in the classical regression make the asymptotic covariance matrix of the restricted least squares estimator smaller than its unrestricted counterpart even if the restrictions are incorrect Ger n then produced plots of F z for z in the range of the sample values of b x Once again the functions are surprisingly close In the German data however the Klein Spady estimator is nonmonotonic over a sizeable range which would cause some dif cult problems of interpretation The maximum score estimator does not produce an estimate of the probability so it is excluded from this comparison Another comparison is based on the predictions of the observed response Two approaches are tried rst counting the number of cases in which the predicted probability exceeds 0 5 b x 0 for MSCORE and second by summing the sample values of F b x Once again MSCORE is excluded By the second approach
    47 The function G z involves an enormous amount of computation on the order of n2 in principle As Ger n n

    1996 observes however computation of the kernel estimator can be cast as a Fourier transform for which the fast Fourier transform reduces the amount of computation to the order of n log2 n This value is only slightly larger than linear in n See Press et al 1986 and Ger n 1996

    Greene 50240

    book

    June 27 2002

    22 39

    706

    CHAPTER 21 Models for Discrete Choice

    TABLE 21 6 x1

    Estimated Parameters for Semiparametric Models
    x2 x3 x4 x5 x6 x7 x8 h

    Probit Single index MSCORE

    5 62 1 35 5 83 1 78

    3 11 0 77 2 98 0 90 2 84 0 98

    0 44 0 10 0 44 0 12 0 40 0 13

    0 03 0 03 0 02 0 03 0 03 0 05

    1 07 0 26 1 32 0 33 0 80 0 43

    0 22 0 09 0 25 0 11 0 16 0 20

    1 00 1 00 1 00

    1 07 0 29 1 06 0 32 0 91 0 57

    0 40 0 70

    the estimators are almost indistinguishable but the results for the rst differ widely Of 401 ones out of 873 observations the counts of predicted ones are 389 for probit 382 for Klein Spady and 355 for MSCORE The results do not indicate how many of these counts are correct predictions
    21 5 5 A KERNEL ESTIMATOR FOR A NONPARAMETRIC REGRESSION FUNCTION

    As noted one unsatisfactory aspect of semiparametric formulations such as MSCORE is that the amount of information that the procedure provides about the population is limited this aspect is after all the purpose of dispensing with the rm parametric assumptions of the probit and logit models Thus in the preceding example there is little that one can say about the population that generated the data based on the MSCORE estimates in the table The estimates do allow predictions of the response variable But there is little information about any relationship between the response and the independent variables based on the estimation results Even the mean squared deviation matrix is suspect as an estimator of the asymptotic covariance matrix of the MSCORE coef cients The authors of the technique have proposed a secondary analysis of the results Let F zi E yi xi zi denote a smooth regression function for the response variable Based on a parameter vector the authors propose to estimate the regression by the method of kernels as follows For the n observations in the sample and for the given e g bn from MSCORE let zi xi s 1 n
    n 1 2

    zi z 2
    i 1



    For a particular value z we compute a set of n weights using the kernel function wi z K z zi s where K ri P ri 1 P ri

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    707

    and The constant c 3 1 0 55133 is used to standardize the logistic distribution that is used for the kernel function See Section 16 4 1 The parameter is the smoothing bandwidth parameter Large values will atten the estimated function through y whereas values close to zero will allow greater variation in the function but might cause it to be unstable There is no good theory for the choice but some suggestions have been made based on descriptive statistics See Wong 1983 and Manski 1986 Finally the function value is estimated with n wi z yi F z i 1 n i 1 wi z
    Example 21 9

    P ri 1 exp cri 1

    Figure 21 3 shows a plot of two estimates of the regression function for E GRADE z The coef cients are the MSCORE estimates given in Table 21 5 The plot is produced by computing tted values for 100 equally spaced points in the range of x bn which for these data and coef cients is 0 66229 0 05505 The function is estimated with two values of the smoothing parameter 1 0 and 0 3 As expected the function based on 1 0 is much atter than that based on 0 3 Clearly the results of the analysis are crucially dependent on the value assumed

    Nonparametric Regression

    The nonparametric estimator displays a relationship between x and E yi At rst blush this relationship might suggest that we could deduce the marginal effects but unfortunately that is not the case The coef cients in this setting are not meaningful so all we can deduce is an estimate of the density f z by using rst differences of the estimated regression function It might seem therefore that the analysis has produced

    FIGURE 21 3

    Nonparametric Regression

    0 72 0 64 0 56 0 48 0 40 0 32 0 24 0 16 0 08 0 00 0 70 0 60 0 50 0 40 0 30 x 0 20 0 10 0 00 0 10 0 3 1 x F

    Greene 50240

    book

    June 27 2002

    22 39

    708

    CHAPTER 21 Models for Discrete Choice

    relatively little payoff for the effort But that should come as no surprise if we reconsider the assumptions we have made to reach this point The only assumptions made thus far are that for a given vector of covariates xi and coef cient vector that is any there exists a smooth function F x E yi zi We have also assumed at least implicitly that the coef cients carry some information about the covariation of x and the response variable The technique will approximate any such function see Manski 1986 There is a large and burgeoning literature on kernel estimation and nonparametric estimation in econometrics A recent application is Melenberg and van Soest 1996 As this simple example suggests with the radically different forms of the speci ed model the information that is culled from the data changes radically as well The general principle now made evident is that the fewer assumptions one makes about the population the less precise the information that can be deduced by statistical techniques That tradeoff is inherent in the methodology
    21 5 6 DYNAMIC BINARY CHOICE MODELS

    A random or xed effects model which explicitly allows for lagged effects would be yit 1 xit i yi t 1 it 0 Lagged effects or persistence in a binary choice setting can arise from three sources serial correlation in it the heterogeneity i or true state dependence through the term yi t 1 Chiappori 1998 and see Arellano 2001 suggests an application to the French automobile insurance market in which the incentives built into the pricing system are such that having an accident in one period should lower the probability of having one in the next state dependence but some drivers remain more likely to have accidents than others in every period which would re ect the heterogeneity instead State dependence is likely to be particularly important in the typical panel which has only a few observations for each individual Heckman 1981a examined this issue at length Among his ndings were that the somewhat muted small sample bias in xed effects models with T 8 was made much worse when there was state dependence A related problem is that with a relatively short panel the initial conditions yi 0 have a crucial impact on the entire path of outcomes Modeling dynamic effects and initial conditions in binary choice models is more complex than in the linear model and by comparison there are relatively fewer rm results in the applied literature Much of the contemporary literature has focused on methods of avoiding the strong parametric assumptions of the probit and logit models Manski 1987 and Honore and Kyriadizou 2000 show that Manski s 1986 maximum score estimator can be applied to the differences of unequal pairs of observations in a two period panel with xed effects However the limitations of the maximum score estimator noted earlier have motivated research on other approaches An extension of lagged effects to a parametric model is Chamberlain 1985 Jones and Landwehr 1988 and Magnac 1997 who added state dependence to Chamberlain s xed effects logit estimator Unfortunately once the identi cation issues are settled the model is only operational if there are no other exogenous variables in it which limits is usefulness for practical application Lewbel 2000 has extended his xed effects estimator to dynamic models as well In this framework the narrow assumptions about the independent variables somewhat

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    709

    limit its practical applicability Honore and Kyriazidou 2000 have combined the logic of the conditional logit model and Manski s maximum score estimator They specify Prob yi 0 1 xi i p0 xi i where xi xi 1 xi 2 xi T t 1 T Prob yit 1 xi i yi 0 yi 1 yi t 1 F xi t i yi t 1

    The analysis assumes a single regressor and focuses on the case of T 3 The resulting estimator resembles Chamberlain s but relies on observations for which xit xi t 1 which rules out direct time effects as well as for practical purposes any continuous variable The restriction to a single regressor limits the generality of the technique as well The need for observations with equal values of xit is a considerable restriction and the authors propose a kernel density estimator for the difference xit xi t 1 instead which does relax that restriction a bit The end result is an estimator which converges they conjecture but to a nonnormal distribution and at a rate slower than n 1 3 Semiparametric estimators for dynamic models at this point in the development are still primarily of theoretical interest Models that extend the parametric formulations to include state dependence have a much longer history including Heckman 1978 1981a 1981b Heckman and MaCurdy 1980 Jakubson 1988 Keane 1993 and Beck et al 2001 to name a few 48 In general even without heterogeneity dynamic models ultimately involve modeling the joint outcome yi 0 yi T which necessitates some treatment involving multivariate integration Example 21 10 describes a recent application
    Example 21 10 An Intertemporal Labor Force Participation Equation

    Hyslop 1999 presents a model of the labor force participation of married women The focus of the study is the high degree of persistence in the participation decision Data used in the study were the years 1979 1985 of the Panel Study of Income Dynamics A sample of 1812 continuously married couples were studied Exogenous variables which appeared in the model were measures of permanent and transitory income and fertility captured in yearly counts of the number of children from 0 2 3 5 and 6 17 years old Hyslop s formulation in general terms is initial condition yi 0 1 xi 0 0 vi 0 0 dynamic model yi t 1 xi t yi t 1 i vi t 0 heterogeneity correlated with participation i zi i Stochastic speci cation
    2 i Xi N 0 2 vi 0 Xi N 0 0 2 wi t Xi N 0 w 2 2 vi t vi t 1 wi t w 1

    Corr vi 0 vi t t t 1 T 1
    48 Beck

    et al 2001 is a bit different from the others mentioned in that in their study of state failure they observe a large sample of countries 147 observed over a fairly large number of years 40 As such they are able to formulate their models in a way that makes the asymptotics with respect to T appropriate They can analyze the data essentially in a time series framework Sepanski 2000 is another application which combines state dependence and the random coef cient speci cation of Akin Guilkey and Sickles 1979

    Greene 50240

    book

    June 27 2002

    22 39

    710

    CHAPTER 21 Models for Discrete Choice

    The presence of the autocorrelation and state dependence in the model invalidate the simple maximum likelihood procedures we have examined earlier The appropriate likelihood function is constructed by formulating the probabilities as Prob yi 0 yi 1 Prob yi 0 Prob yi 1 yi 0 Prob yi T yi T 1 This still involves a T 7 order normal integration which is approximated in the study using a simulator similar to the GHK simulator discussed in E 4 2e Among Hyslop s results are a comparison of the model t by the simulator for the multivariate normal probabilities with the same model t using the maximum simulated likelihood technique described in Section 17 8

    21 6

    BIVARIATE AND MULTIVARIATE PROBIT MODELS

    In Chapter 14 we analyzed a number of different multiple equation extensions of the classical and generalized regression model A natural extension of the probit model would be to allow more than one equation with correlated disturbances in the same spirit as the seemingly unrelated regressions model The general speci cation for a two equation model would be
    y1 x1 1 1 y2 x2 2 2

    y1 1 y2 1

    if y1 0 0 otherwise if y2 0 0 otherwise

    E 1 x1 x2 E 2 x1 x2 0 Var 1 x1 x2 Var 2 x1 x2 1 Cov 1 2 x1 x2
    21 6 1 MAXIMUM LIKELIHOOD ESTIMATION

    21 41

    The bivariate normal cdf is Prob X1 x1 X2 x2 which we denote
    2 x1 x2 x2 x1

    2 z1 z2 dz1 dz2

    The density is e 1 2 x1 x2 2 x1 x2 1 49 2 1 2 1 2
    2 2 2

    2 x1 x2

    To construct the log likelihood let qi 1 2 yi 1 1 and qi 2 2 yi 2 1 Thus qi j 1 if yi j 1 and 1 if yi j 0 for j 1 and 2 Now let zi j x i j j and i qi 1 qi 2 Note the national convention The subscript 2 is used to indicate the bivariate normal distribution in the density 2 and cdf 2 In all other cases the subscript 2 indicates
    49 See

    and

    wi j qi j zi j

    j 1 2

    Section B 9

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    711

    the variables in the second equation above As before and denote the univariate standard normal density and cdf The probabilities that enter the likelihood function are Prob Y1 yi 1 Y2 yi 2 x1 x2

    without subscripts

    2 wi 1 wi 2 t

    which accounts for all the necessary sign changes needed to compute probabilities for ys equal to zero and one Thus
    n

    log L
    i 1

    ln

    2 wi 1 wi 2 i

    50

    The derivatives of the log likelihood then reduce to ln L j ln L where gi 1 wi 1 wi 2 i wi 1 1 i2 21 43
    n i 1 n i 1

    qi j gi j
    2

    xi j

    j 1 2 21 42

    qi 1 qi 2 2
    2

    and the subscripts 1 and 2 in gi 1 are reversed to obtain gi 2 Before considering the Hessian it is useful to note what becomes of the preceding if 0 For ln L 1 if i 0 then gi 1 reduces to wi 1 wi 2 2 is wi 1 wi 2 and 2 is wi 1 wi 2 Inserting these results in 21 42 with qi 1 and qi 2 produces 21 21 Since both functions in ln L factor into the product of the univariate functions ln L reduces to n i 1 i 1 i 2 where i j j 1 2 is de ned in 21 21 This result will reappear in the LM statistic below The maximum likelihood estimates are obtained by simultaneously setting the three derivatives to zero The second derivatives are relatively straightforward but tedious Some simpli cations are useful Let 1 i2 vi 1 i wi 2 i wi 1 vi 2 i wi 1 i wi 2 By multiplying it out you can show that i wi 1 vi 1 i wi 2 vi 2 2
    50 To 2

    i

    1

    so gi 1 wi 1 vi 1 so gi 2 wi 2 vi 2



    avoid further ambiguity and for convenience the observation subscript will be omitted from 2 wi 1 wi 2 i and from 2 2 wi 1 wi 2 i

    Greene 50240

    book

    June 27 2002

    22 39

    712

    CHAPTER 21 Models for Discrete Choice

    Then 2 log L 1 1 2 log L 1 2 2 log L 1 2 log L 2
    n

    xi 1 xi 1
    i 1 n

    wi 1 gi 1
    2



    i 2
    2



    gi21
    2 2



    qi 1 qi 2 xi 1 xi 2
    i 1 n

    2
    2



    gi 1 gi 2
    2 2

    gi 1
    2

    qi 2 xi 1
    i 1 n i 1

    2
    2

    i i vi 1 wi 1

    2
    2

    2
    2

    i2 i 1 wi Ri 1 wi i2 wi 1 wi 2



    where wi Ri 1 wi i2 wi21 wi22 2 i wi 1 wi 2 For 2 change the subscripts in 2 ln L 1 1 and 2 ln L 1 accordingly The complexity of the second derivatives for this model makes it an excellent candidate for the Berndt et al estimator of the variance matrix of the maximum likelihood estimator
    21 6 2 TESTING FOR ZERO CORRELATION

    The Lagrange multiplier statistic is a convenient device for testing for the absence of correlation in this model Under the null hypothesis that equals zero the model consists of independent probit equations which can be estimated separately Moreover in the multivariate model all the bivariate or multivariate densities and probabilities factor into the products of the marginals if the correlations are zero which makes construction of the test statistic a simple matter of manipulating the results of the independent probits The Lagrange multiplier statistic for testing H0 0 in a bivariate probit model is51 wi 1 wi 2 2 wi 1 wi 2 wi 1 wi 2 2 wi 1 wi 1 wi 2 wi 2
    n i 1

    qi 1 qi 2

    LM
    n i 1

    As usual the advantage of the LM statistic is that it obviates computing the bivariate probit model But the full unrestricted model is now fairly common in commercial software so that advantage is minor The likelihood ratio or Wald test can often be used with equal ease
    21 6 3 MARGINAL EFFECTS

    There are several marginal effects one might want to evaluate in a bivariate probit model 52 For convenience in evaluating them we will de ne a vector x x1 x2 and let
    51 This 52 See

    is derived in Kiefer 1982

    Greene 1996b

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    713

    x1 1 x 1 Thus 1 contains all the nonzero elements of 1 and possibly some zeros in the positions of variables in x that appear only in the other equation 2 is de ned likewise The bivariate probability is Prob y1 1 y2 1 x
    2 x

    1 x 2

    Signs are changed appropriately if the probability of the zero outcome is desired in either case See 21 41 The marginal effects of changes in x on this probability are given by 2 g1 1 g2 2 x where g1 and g2 are de ned in 21 43 The familiar univariate cases will arise if 0 and effects speci c to one equation or the other will be produced by zeros in the corresponding position in one or the other parameter vector There are also some conditional mean functions to consider The unconditional mean functions are given by the univariate probabilities E y j x x j j 1 2

    so the analysis of 21 9 and 21 10 applies One pair of conditional mean functions that might be of interest are E y1 y2 1 x Prob y1 1 y2 1 x
    2 x

    Prob y1 1 y2 1 x Prob y2 1 x

    1 x 2 x 2 1 x 2
    2 x

    and similarly for E y2 y1 1 x The marginal effects for this function are given by E y1 y2 1 x x g1 1 g2
    2

    x 2 x 2

    2

    Finally one might construct the nonlinear conditional mean function E y1 y2 x 1 2 y2 1 x 2 2 y2 1 2 y2 1 x 2

    The derivatives of this function are the same as those above with sign changes in several places if y2 0 is the argument
    21 6 4 SAMPLE SELECTION

    There are situations in which the observed variables in the bivariate probit model are censored in one way or another For example in an evaluation of credit scoring models Boyes Hoffman and Low 1989 analyzed data generated by the following rule y1 1 y2 2 if individual i defaults on a loan 0 otherwise if the individual is granted a loan 0 otherwise

    Greene 1992 applied the same model to y1 default on credit card loans in which y2 denotes whether an application for the card was accepted or not For a given individual

    Greene 50240

    book

    June 27 2002

    22 39

    714

    CHAPTER 21 Models for Discrete Choice

    y1 is not observed unless y2 equals one Thus there are three types of observations in the sample with unconditional probabilities 53 y2 0 y1 0 y2 1 y1 1 y2 1 Prob y2 0 x1 x2 1 x2 2

    Prob y1 0 y2 1 x1 x2 Prob y1 1 y2 1 x1 x2

    2 x1 1 x2 2 2 x1 1 x2 2

    The log likelihood function is based on these probabilities 54
    21 6 5 A MULTIVARIATE PROBIT MODEL

    In principle a multivariate model would extend 21 41 to more than two outcome variables just by adding equations The practical obstacle to such an extension is primarily the evaluation of higher order multivariate normal integrals Some progress has been made on using quadrature for trivariate integration but existing results are not suf cient to allow accurate and ef cient evaluation for more than two variables in a sample of even moderate size An altogether different approach has been used in recent applications Lerman and Manski 1981 suggested that one might approximate multivariate normal probabilities by random sampling For example to approximate Prob y1 1 y2 3 y3 1 x1 x2 12 13 23 we would simply draw random observations from this trivariate normal distribution see Section E 5 6 and count the number of observations that satisfy the inequality To obtain an accurate estimate of the probability quite a large number of draws is required Also the substantive possibility of getting zero such draws in a nite number of draws is problematic Nonetheless the logic of the Lerman Manski approach is sound As discussed in Section E 5 6 recent developments have produced methods of producing quite accurate estimates of multivariate normal integrals based on this principle The evaluation of multivariate normal integral is generally a much less formidable obstacle to the estimation of models based on the multivariate normal distribution 55 McFadden 1989 pointed out that for purposes of maximum likelihood estimation accurate evaluation of probabilities is not necessarily the problem that needs to be solved One can view the computation of the log likelihood and its derivatives as a problem of estimating a mean That is in 21 41 and 21 42 the same problem arises if we divide by n The idea is that even though the individual terms in the average might be in error if the error has mean zero then it will average out in the summation The important insight then is that if we can obtain probability estimates that only err randomly both positively and negatively then it may be possible to obtain an estimate of the log likelihood and its derivatives that is reasonably close to the one that would
    53 The

    model was rst proposed by Wynand and van Praag 1981

    54 Extensions 55 Papers

    of the bivariate probit model to other types of censoring are discussed in Poirier 1980 and Abowd and Farber 1982

    that propose improved methods of simulating probabilities include Pakes and Pollard 1989 and especially Borsch Supan and Hajivassilou 1990 Geweke 1989 and Keane 1994 A symposium in the November 1994 issue of Review of Economics and Statistics presents discussion of numerous issues in speci cation and estimation of models based on simulation of probabilities Applications that employ simulation techniques for evaluation of multivariate normal integrals are now fairly numerous See for example Hyslop 1999 Example 21 10 who applies the technique to a panel data application with T 7

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    715

    result from actually computing the integral From a practical standpoint it does not take inordinately large numbers of random draws to achieve this result which with the progress that has been made on Monte Carlo integration has made feasible multivariate models that previously were intractable The multivariate probit model in another form presents a useful extension of the probit model to panel data The structural equation for the model would be
    yit xi t it

    yit 1

    if yit 0 0 otherwise i 1 n t 1 T

    The Butler and Mof tt approach for this model has proved useful in numerous applications But the underlying assumption that Cov it is is a substantive restriction By treating this structure as a multivariate probit model with a restriction that the coef cient vector be the same in every period one can obtain a model with free correlations across periods Hyslop 1999 and Greene 2002 are two applications
    21 6 6 APPLICATION GENDER ECONOMICS COURSES IN LIBERAL ARTS COLLEGES

    Burnett 1997 proposed the following bivariate probit model for the presence of a gender economics course in the curriculum of a liberal arts college Prob y1 1 y2 1 x1 x2 The dependent variables in the model are y1 presence of a gender economics course y2 presence of a women s studies program on the campus The independent variables in the model are z1 constant term z2 academic reputation of the college coded 1 best 2 to 141 z3 size of the full time economics faculty a count z4 percentage of the economics faculty that are women proportion 0 to 1 z5 religious af liation of the college 0 no 1 yes z6 percentage of the college faculty that are women proportion 0 to 1 z7 z10 regional dummy variables south midwest northeast west The regressor vectors are x1 z1 z2 z3 z4 z5 x2 z2 z6 z5 z7 z10
    2 x1 1

    y2 x2 2

    Burnett s model illustrates a number of interesting aspects of the bivariate probit model Note that this model is qualitatively different from the bivariate probit model in 21 41 the second dependent variable y2 appears on the right hand side of the rst equation This model is a recursive simultaneous equations model Surprisingly the endogenous nature of one of the variables on the right hand side of the rst equation can be ignored in formulating the log likelihood The model appears in Maddala 1983 p 123 We can establish this fact with the following admittedly trivial argument The term that

    Greene 50240

    book

    June 27 2002

    22 39

    716

    CHAPTER 21 Models for Discrete Choice

    enters the log likelihood is P y1 1 y2 1 P y1 1 y2 1 P y2 1 Given the model as stated the marginal probability for y2 is just x2 2 whereas the conditional probability is 2 x2 2 The product returns the probability we had earlier The other three terms in the log likelihood are derived similarly which produces Maddala s results with some sign changes P11 P01
    2 x1 1

    y2 x2 2 y2 2 x2

    P10 P00

    2 x1 1 x2 2 2 x1 1 x2 2

    2 x1 1

    These terms are exactly those of 21 41 that we obtain just by carrying y2 in the rst equation with no special attention to its endogenous nature We can ignore the simultaneity in this model and we cannot in the linear regression model because in this instance we are maximizing the log likelihood whereas in the linear regression case we are manipulating certain sample moments that do not converge to the necessary population parameters in the presence of simultaneity Note that the same result is at work in Section 15 6 2 where the FIML estimator of the simultaneous equations model is obtained with the endogenous variables on the right hand sides of the equations but not by using ordinary least squares The marginal effects in this model are fairly involved and as before we can consider several different types Consider for example z2 academic reputation There is a direct effect produced by its presence in the rst equation but there is also an indirect effect Academic reputation enters the women s studies equation and therefore in uences the probability that y2 equals one Since y2 appears in the rst equation this effect is transmitted back to y1 The total effect of academic reputation and likewise religious af liation is the sum of these two parts Consider rst the gender economics variable y1 The conditional mean is E y1 x1 x2 Prob y2 1 E y1 y2 1 x1 x2 Prob y2 0 E y1 y2 0 x1 x2
    2 x1 1

    y2 x2 2

    2 x1 1 x2 2

    Derivatives can be computed using our earlier results We are also interested in the effect of religious af liation Since this variable is binary simply differentiating the conditional mean function may not produce an accurate result Instead we would compute the conditional mean function with this variable set to one and then zero and take the difference Finally what is the effect of the presence of a women s studies program on the probability that the college will offer a gender economics course To compute this effect we would compute Prob y1 1 y2 1 x1 x2 Prob y1 1 y2 0 x1 x2 In all cases standard errors for the estimated marginal effects can be computed using the delta method Maximum likelihood estimates of the parameters of Burnett s model were computed by Greene 1998 using her sample of 132 liberal arts colleges 31 of the schools offer gender economics 58 have women s studies and 29 have both The estimated parameters are given in Table 21 7 Both bivariate probit and the single equation estimates are given The estimate of is only 0 1359 with a standard error of 1 2359 The Wald statistic for the test of the hypothesis that equals zero is 0 1359 1 2539 2 0 011753 For a single restriction the critical value from the chi squared table is 3 84 so the hypothesis cannot be rejected The likelihood ratio statistic for the same hypothesis is

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    717

    TABLE 21 7

    Estimates of a Recursive Simultaneous Bivariate Probit Model Estimated Standard Errors in Parentheses
    Single Equation Coef cient Standard Error Bivariate Probit Coef cient Standard Error

    Variable

    Gender Economics Equation Constant 1 4176 AcRep 0 01143 WomStud 1 1095 EconFac 0 06730 PctWecon 2 5391 Relig 0 3482 Women s Studies Equation AcRep 0 01957 PctWfac 1 9429 Relig 0 4494 South 1 3597 West 2 3386 North 1 8867 Midwest 1 8248 0 0000 Log L 85 6458

    0 8069 0 004081 0 5674 0 06874 0 9869 0 4984 0 005524 0 8435 0 3331 0 6594 0 8104 0 8204 0 8723 0 0000

    1 1911 0 01233 0 8835 0 06769 2 5636 0 3741 0 01939 1 8914 0 4584 1 3471 2 3376 1 9009 1 8070 0 1359 85 6317

    2 2155 0 007937 2 2603 0 06952 1 0144 0 5265 0 005704 0 8714 0 3403 0 6897 0 8611 0 8495 0 8952 1 2539

    2 85 6317 85 6458 0 0282 which leads to the same conclusion The Lagrange multiplier statistic is 0 003807 which is consistent This result might seem counterintuitive given the setting Surely gender economics and women s studies are highly correlated but this nding does not contradict that proposition The correlation coef cient measures the correlation between the disturbances in the equations the omitted factors That is measures roughly the correlation between the outcomes after the in uence of the included factors is accounted for Thus the value 0 13 measures the effect after the in uence of women s studies is already accounted for As discussed in the next paragraph the proposition turns out to be right The single most important determinant at least within this model of whether a gender economics course will be offered is indeed whether the college offers a women s studies program Table 21 8 presents the estimates of the marginal effects and some descriptive statistics for the data The calculations were simpli ed slightly by using the restricted model with 0 Computations of the marginal effects still require the decomposition above but they are simpli ed slightly by the result that if equals zero then the bivariate probabilities factor into the products of the marginals Numerically the strongest effect appears to be exerted by the representation of women on the faculty its coef cient of 0 4491 is by far the largest This variable however cannot change by a full unit because it is a proportion An increase of 1 percent in the presence of women on the faculty raises the probability by only 0 004 which is comparable in scale to the effect of academic reputation The effect of women on the faculty is likewise fairly small only 0 0013 per 1 percent change As might have been expected the single most important in uence is the presence of a women s studies program which increases the likelihood of a gender economics course by a full 0 1863 Of course the raw data would have anticipated this result of the 31 schools that offer a gender economics course 29 also

    Greene 50240

    book

    June 27 2002

    22 39

    718

    CHAPTER 21 Models for Discrete Choice

    TABLE 21 8

    Marginal Effects in Gender Economics Model
    Direct Indirect Total Std Error Type of Variable Mean

    Gender Economics Equation AcRep 0 002022 0 001453 PctWecon 0 4491 EconFac 0 01190 Relig 0 07049 0 03227 WomStud 0 1863 PctWfac 0 13951 Women s Studies Equation AcRep 0 00754 PctWfac 0 13789 Relig 0 13265

    0 003476 0 4491 0 1190 0 1028 0 1863 0 13951 0 00754 0 13789 0 13266

    0 00126 0 1568 0 01292 0 1055 0 0868 0 08916 0 002187 0 01002 0 18803

    Continuous 119 242 Continuous 0 24787 Continuous 6 74242 Binary 0 57576 Endogenous 0 43939 Continuous 0 35772 Continuous Continuous Binary 119 242 0 35772 0 57576

    have a women s studies program and only two do not Note nally that the effect of religious af liation whatever it is is mostly direct Before closing this application we can use this opportunity to examine the t measures listed in Section 21 4 5 We computed the various t measures using seven different speci cations of the gender economics equation 1 2 3 4 5 6 7 Single equation probit estimates z1 z2 z3 z4 z5 y2 Bivariate probit model estimates z1 z2 z3 z4 z5 y2 Single equation probit estimates z1 z2 z3 z4 z5 Single equation probit estimates z1 z3 z5 y2 Single equation probit estimates z1 z3 z5 Single equation probit estimates z1 z5 Single equation probit estimates z1 constant only

    The speci cations are in descending quality because we removed the most statistically signi cant variables from the model at each step The values are listed in Table 21 9 The matrix below each column is the table of hits and misses of the prediction rule y 1 if P 0 5 0 otherwise Note that by construction model 7 must predict all ones or all zeros The column is the actual count and the row is the prediction Thus for model 1 92 of 101 zeros were predicted correctly whereas ve of 31 ones were predicted incorrectly As one would hope the t measures decline as the more signi cant

    TABLE 21 9 Measure

    Binary Choice Fit Measures
    1 2 3 4 5 6 7

    LRI 2 RBL 2 REF 2 RVZ 2 RMZ Predictions

    0 573 0 844 0 565 0 561 0 708 0 687 92 9 5 26

    0 535 0 844 0 560 0 558 0 707 0 679 93 8 5 26

    0 495 0 823 0 526 0 530 0 672 0 628 92 9 8 23

    0 407 0 797 0 444 0 475 0 589 0 567 94 7 8 23

    0 279 0 754 0 319 0 343 0 447 0 545 98 3 16 15

    0 206 0 718 0 216 0 216 0 352 0 329 101 0 31 0

    0 000 0 641 0 000 0 000 0 000 0 000 101 0 31 0

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    719

    variables are removed from the model The Ben Akiva measure has an obvious aw in that with only a constant term the model still obtains a t of 0 641 From the prediction matrices it is clear that the explanatory power of the model such as it is comes from its ability to predict the ones correctly The poorer is the model the greater the number of correct predictions of y 0 But as this number rises the number of incorrect predictions rises and the number of correct predictions of y 1 declines All the t measures appear to react to this feature to some degree The Efron and Cramer measures which are nearly identical and McFadden s LRI appear to be most sensitive to this with the remaining two only slightly less consistent

    21 7

    LOGIT MODELS FOR MULTIPLE CHOICES

    Some studies of multiple choice settings include the following 1 2 3 Hensher 1986 McFadden 1974 and many others have analyzed the travel mode of urban commuters Schmidt and Strauss 1975a b and Boskin 1974 have analyzed occupational choice among multiple alternatives Terza 1985 has studied the assignment of bond ratings to corporate bonds as a choice among multiple alternatives

    These are all distinct from the multivariate probit model we examined earlier In that setting there were several decisions each between two alternatives Here there is a single decision among two or more alternatives We will examine two broad types of choice sets ordered and unordered The choice among means of getting to work by car bus train or bicycle is clearly unordered A bond rating is by design a ranking that is its purpose As we shall see quite different techniques are used for the two types of models Models for unordered choice sets are considered in this section A model for ordered choices is described in Section 21 8 Unordered choice models can be motivated by a random utility model For the i th consumer faced with J choices suppose that the utility of choice j is Ui j zi j i j If the consumer makes choice j in particular then we assume that Ui j is the maximum among the J utilities Hence the statistical model is driven by the probability that choice j is made which is Prob Ui j Uik for all other k j

    The model is made operational by a particular choice of distribution for the disturbances As before two models have been considered logit and probit Because of the need to evaluate multiple integrals of the normal distribution the probit model has found rather limited use in this setting The logit model in contrast has been widely used in many elds including economics market research and transportation engineering Let Yi be a random variable that indicates the choice made McFadden 1973 has shown that if and only if the J disturbances are independent and identically distributed with

    Greene 50240

    book

    June 27 2002

    22 39

    720

    CHAPTER 21 Models for Discrete Choice

    type I extreme value Gumbel distribution F i j exp e i j then Prob Yi j ezi j
    J zi j j 1 e



    21 44

    which leads to what is called the conditional logit model 56 Utility depends on xi j which includes aspects speci c to the individual as well as to the choices It is useful to distinguish them Let zi j xi j wi Then xi j varies across the choices and possibly across the individuals as well The components of xi j are typically called the attributes of the choices But wi contains the characteristics of the individual and is therefore the same for all choices If we incorporate this fact in the model then 21 44 becomes Prob Yi j e xi j wi
    J xi j wi j 1 e



    e xi j e i wi
    J xi j e i wi j 1 e



    Terms that do not vary across alternatives that is those speci c to the individual fall out of the probability Evidently if the model is to allow individual speci c effects then it must be modi ed One method is to create a set of dummy variables for the choices and multiply each of them by the common w We then allow the coef cient to vary across the choices instead of the characteristics Analogously to the linear model a complete set of interaction terms creates a singularity so one of them must be dropped For example a model of a shopping center choice by individuals might specify that the choice depends on attributes of the shopping centers such as number of stores and distance from the central business district both of which are the same for all individuals and income which varies across individuals Suppose that there were three choices The three regressor vectors would be as follows Choice 1 Choice 2 Choice 3 Stores Distance Stores Distance Stores Distance Income 0 0 0 Income 0

    The data sets typically analyzed by economists do not contain mixtures of individualand choice speci c attributes Such data would be far too costly to gather for most purposes When they do the preceding framework can be used For the present it is useful to examine the two types of data separately and consider aspects of the model that are speci c to the two types of applications
    21 7 1 THE MULTINOMIAL LOGIT MODEL

    To set up the model that applies when data are individual speci c it will help to consider an example Schmidt and Strauss 1975a b estimated a model of occupational
    56 It

    is occasionally labeled the multinomial logit model but this wording con icts with the usual name for the model discussed in the next section which differs slightly Although the distinction turns out to be purely arti cial we will maintain it for the present

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    721

    choice based on a sample of 1000 observations drawn from the Public Use Sample for three years 1960 1967 and 1970 For each sample the data for each individual in the sample consist of the following 1 2 Occupation 0 menial 1 blue collar 2 craft 3 white collar 4 professional Regressors constant education experience race sex e j xi
    4 k xi k 0 e

    The model for occupational choice is Prob Yi j j 0 1 4 21 45

    The binomial logit of Sections 21 3 and 21 4 is conveniently produced as the special case of J 1 The model in 21 45 is a multinomial logit model 57 The estimated equations provide a set of probabilities for the J 1 choices for a decision maker with characteristics xi Before proceeding we must remove an indeterminacy in the model If we de ne j q for any vector q then recomputing the probabilities de ned below using j instead of j produces the identical set of probabilities because all the terms involvj ing q drop out A convenient normalization that solves the problem is 0 0 This arises because the probabilities sum to one so only J parameter vectors are needed to determine the J 1 probabilities Therefore the probabilities are Prob Yi j xi e j xi 1
    J k xi k 1 e

    for j 0 2 J 0 0

    21 46

    The form of the binomial model examined in Section 21 4 results if J 1 The model implies that we can compute J log odds ratios ln Pi j xi j k xi j Pik if k 0

    From the point of view of estimation it is useful that the odds ratio Pj Pk does not depend on the other choices which follows from the independence of disturbances in the original model From a behavioral viewpoint this fact is not very attractive We shall return to this problem in Section 21 7 3 The log likelihood can be derived by de ning for each individual di j 1 if alternative j is chosen by individual i and 0 if not for the J 1 possible outcomes Then for each i one and only one of the di j s is 1 The log likelihood is a generalization of that for the binomial probit or logit model
    n J

    ln L
    i 1 j 0

    di j ln Prob Yi j

    The derivatives have the characteristically simple form ln L j
    57 Nerlove

    di j Pi j xi
    i

    for j 1 J

    and Press 1973

    Greene 50240

    book

    June 27 2002

    22 39

    722

    CHAPTER 21 Models for Discrete Choice

    The exact second derivatives matrix has J 2 K K blocks 2 ln L j l
    n

    Pi j 1 j l Pil xi xi 58
    i 1

    where 1 j l equals 1 if j equals l and 0 if not Since the Hessian does not involve di j these are the expected values and Newton s method is equivalent to the method of scoring It is worth noting that the number of parameters in this model proliferates with the number of choices which is unfortunate because the typical cross section sometimes involves a fairly large number of regressors The coef cients in this model are dif cult to interpret It is tempting to associate j with the j th outcome but that would be misleading By differentiating 21 46 we nd that the marginal effects of the characteristics on the probabilities are j Pj Pj j xi
    J

    Pk k Pj j
    k 0

    21 47

    Therefore every subvector of enters every marginal effect both through the probabilities and through the weighted average that appears in j These values can be computed from the parameter estimates Although the usual focus is on the coef cient estimates equation 21 47 suggests that there is at least some potential for confusion Note for example that for any particular xk Pj xk need not have the same sign as jk Standard errors can be estimated using the delta method See Section 5 2 4 For purposes of the computation let 0 1 2 j We include the xed 0 vector for outcome 0 because although 0 0 0 P0 which is not 0 Note as well that 0 j 0 for j 0 J Then Asy Cov
    J J

    Asy Var j
    l 0 m 0

    j j Asy Cov l m l m

    j 1 j l Pl Pj I j x Pj l x l Finding adequate t measures in this setting presents the same dif culties as in the binomial models As before it is useful to report the log likelihood If the model contains no covariates and no constant term then the log likelihood will be
    J

    ln Lc
    j 0

    n j ln

    1 J 1

    where n j is the number of individuals who choose outcome j If the regressor vector includes only a constant term then the restricted log likelihood is
    J

    ln L0
    j 0
    58 If

    n j ln

    nj n

    J


    j 0

    n j ln p j

    the data were in the form of proportions such as market shares then the appropriate log likelihood n p and n p Pi j xi respectively The terms in the Hessian are and derivatives are i j i ij i j i ij multiplied by ni

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    723

    where p j is the sample proportion of observations that make choice j If desired the likelihood ratio index can also be reported A useful table will give a listing of hits and misses of the prediction rule predict Yi j if P j is the maximum of the predicted probabilities 59
    21 7 2 THE CONDITIONAL LOGIT MODEL

    When the data consist of choice speci c attributes instead of individual speci c characteristics the appropriate model is Prob Yi j zi 1 zi 2 zi J e zi j
    J zi j j 1 e



    21 48

    Here in accordance with the convention in the literature we let j 1 2 J for a total of J alternatives The model is otherwise essentially the same as the multinomial logit Even more care will be required in interpreting the parameters however Once again an example will help to focus ideas In this model the coef cients are not directly tied to the marginal effects The marginal effects for continuous variables can be obtained by differentiating 21 48 with respect to x to obtain Pj Pj 1 j k Pk xk k 1 J

    To avoid cluttering the notation we have dropped the observation subscript It is clear that through its presence in Pj and Pk every attribute set x j affects all the probabilities Hensher suggests that one might prefer to report elasticities of the probabilities The effect of attribute m of choice k on Pj would be log Pj xkm 1 j k Pk m log xkm Since there is no ambiguity about the scale of the probability itself whether one should report the derivatives or the elasticities is largely a matter of taste Some of Hensher s elasticity estimates are given in Table 21 16 later on in this chapter Estimation of the conditional logit model is simplest by Newton s method or the method of scoring The log likelihood is the same as for the multinomial logit model Once again we de ne di j 1 if Yi j and 0 otherwise Then
    n J

    log L
    i 1 j 1

    di j log Prob Yi j

    Market share and frequency data are common in this setting If the data are in this form then the only change needed is once again to de ne di j as the proportion or frequency
    59 Unfortunately

    it is common for this rule to predict all observation with the same value in an unbalanced sample or a model with little explanatory power

    Greene 50240

    book

    June 27 2002

    22 39

    724

    CHAPTER 21 Models for Discrete Choice

    Because of the simple form of L the gradient and Hessian have particularly convenient forms Let xi J 1 Pi j xi j Then j log L
    n J

    di j xi j xi
    i 1 j 1 n J

    2 log L

    Pi j xi j xi xi j xi
    i 1 j 1

    The usual problems of t measures appear here The log likelihood ratio and tabulation of actual versus predicted choices will be useful There are two possible constrained log likelihoods Since the model cannot contain a constant term the constraint 0 renders all probabilities equal to 1 J The constrained log likelihood for this constraint is then Lc n ln J Of course it is unlikely that this hypothesis would fail to be rejected Alternatively we could t the model with only the J 1 choice speci c constants which makes the constrained log likelihood the same as in the multinomial logit model ln L 0 j n j ln p j where as before n j is the number of individuals who choose alternative j
    21 7 3 THE INDEPENDENCE FROM IRRELEVANT ALTERNATIVES

    We noted earlier that the odds ratios in the multinomial logit or conditional logit models are independent of the other alternatives This property is convenient as regards estimation but it is not a particularly appealing restriction to place on consumer behavior The property of the logit model whereby Pj Pk is independent of the remaining probabilities is called the independence from irrelevant alternatives IIA The independence assumption follows from the initial assumption that the disturbances are independent and homoscedastic Later we will discuss several models that have been developed to relax this assumption Before doing so we consider a test that has been developed for testing the validity of the assumption Hausman and McFadden 1984 suggest that if a subset of the choice set truly is irrelevant omitting it from the model altogether will not change parameter estimates systematically Exclusion of these choices will be inef cient but will not lead to inconsistency But if the remaining odds ratios are not truly independent from these alternatives then the parameter estimates obtained when these choices are included will be inconsistent This observation is the usual basis for Hausman s speci cation test The statistic is 2 s f Vs V f 1 s f where s indicates the estimators based on the restricted subset f indicates the estimator based on the full set of choices and Vs and V f are the respective estimates of the asymptotic covariance matrices The statistic has a limiting chi squared distribution with K degrees of freedom 60

    60 McFadden

    1987 shows how this hypothesis can also be tested using a Lagrange multiplier test

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice 21 7 4 NESTED LOGIT MODELS

    725

    If the independence from irrelevant alternatives test fails then an alternative to the multinomial logit model will be needed A natural alternative is a multivariate probit model Uj x j j j 1 J 1 2 J N 0

    We had considered this model earlier but found that as a general model of consumer choice its failings were the practical dif culty of computing the multinormal integral and estimation of an unrestricted correlation matrix Hausman and Wise 1978 point out that for a model of consumer choice the probit model may not be as impractical as it might seem First for J choices the comparisons implicit in U j Uk for k j involve the J 1 differences j k Thus starting with a J dimensional problem we need only consider derivatives of J 1 order probabilities Therefore to come to a concrete example a model with four choices requires only the evaluation of bivariate normal integrals which albeit still complicated to estimate is well within the received technology For larger models however other speci cations have proved more useful One way to relax the homoscedasticity assumption in the conditional logit model that also provides an intuitively appealing structure is to group the alternatives into subgroups that allow the variance to differ across the groups while maintaining the IIA assumption within the groups This speci cation de nes a nested logit model To x ideas it is useful to think of this speci cation as a two or more level choice problem although once again the model arises as a modi cation of the stochastic speci cation in the original conditional logit model not as a model of behavior Suppose then that the J alternatives can be divided into L subgroups such that the choice set can be written c1 c J c1 1 c J 1 1 c1 L c J L L Logically we may think of the choice process as that of choosing among the L choice sets and then making the speci c choice within the chosen set This method produces a tree structure which for two branches and say ve choices might look as follows
    Choice Branch1 c1 1 c2 1 Branch2 c1 2 c2 2 c3 2

    Suppose as well that the data consist of observations on the attributes of the choices x j l and attributes of the choice sets zl To derive the mathematical form of the model we begin with the unconditional probability Prob twig j branchl Pjl ex j l zl
    L l 1 Jl x j l zl j 1 e



    Greene 50240

    book

    June 27 2002

    22 39

    726

    CHAPTER 21 Models for Discrete Choice

    Now write this probability as Pjl Pj l Pl ex j l
    Jl xj l j 1 e

    ezl
    L zl l 1 e

    Jl xj l j 1 e L l 1

    L zl l 1 e

    Jl x j l zl j 1 e



    De ne the inclusive value for the l th branch as
    Jl

    Il ln
    j 1

    ex j l

    Then after canceling terms and using this result we nd Pj l ex j l
    Jl xj l j 1 e

    and

    Pl

    ezl l Il
    L zl l Il l 1 e



    where the new parameters l must equal 1 to produce the original model Therefore we use the restriction l 1 to recover the conditional logit model and the preceding equation just writes this model in another form The nested logit model arises if this restriction is relaxed The inclusive value coef cients unrestricted in this fashion allow the model to incorporate some degree of heteroscedasticity Within each branch the IIA restriction continues to hold The equal variance of the disturbances within the j th branch are now j2 2 61 6 j

    With j 1 this reverts to the basic result for the multinomial logit model As usual the coef cients in the model are not directly interpretable The derivatives that describe covariation of the attributes and probabilities are ln Prob choicec branchb 1 b B 1 c C PC B x k in choice C and branch B B 1 b B PB PC B k The nested logit model has been extended to three and higher levels The complexity of the model increases geometrically with the number of levels But the model has been found to be extremely exible and is widely used for modeling consumer choice and in the marketing and transportation literatures to name a few There are two ways to estimate the parameters of the nested logit model A limited information two step maximum likelihood approach can be done as follows 1 2 Estimate by treating the choice within branches as a simple conditional logit model Compute the inclusive values for all the branches in the model Estimate and the parameters by treating the choice among branches as a conditional logit model with attributes zl and Il
    Hensher Louviere and Swaite 2000

    61 See

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    727

    Since this approach is a two step estimator the estimate of the asymptotic covariance matrix of the estimates at the second step must be corrected See Section 4 6 McFadden 1984 and Greene 1995a Chapter 25 For full information maximum likelihood FIML estimation of the model the log likelihood is
    n

    ln L
    i 1

    ln Prob twig branch Prob branch i

    The information matrix is not block diagonal in and so FIML estimation will be more ef cient than two step estimation To specify the nested logit model it is necessary to partition the choice set into branches Sometimes there will be a natural partition such as in the example given by Maddala 1983 when the choice of residence is made rst by community then by dwelling type within the community In other instances however the partitioning of the choice set is ad hoc and leads to the troubling possibility that the results might be dependent on the branches so de ned Many studies in this literature present several sets of results based on different speci cations of the tree structure There is no well de ned testing procedure for discriminating among tree structures which is a problematic aspect of the model
    21 7 5 A HETEROSCEDASTIC LOGIT MODEL

    Bhat 1995 and Allenby and Ginter 1995 have developed an extension of the conditional logit model that works around the dif culty of specifying the tree for a nested model Their model is based on the same random utility structure as before Ui j xi j i j The logit model arises from the assumption that i j has a homoscedastic extreme value HEV distribution with common variance 2 6 The authors proposed model simply relaxes the assumption of equal variances Since the comparisons are all pairwise one of the variances is set to 1 0 the same comparisons of utilities will result if all equations are multiplied by the same constant so the indeterminacy is removed by setting one of the variances to one The model that remains then is exactly as before with the additional assumption that Var i j j with J 1 0
    21 7 6 MULTINOMIAL MODELS BASED ON THE NORMAL DISTRIBUTION

    A natural alternative model that relaxes the independence restrictions built into the multinomial logit MNL model is the multinomial probit MNP model The structural equations of the MNP model are Uj x j j j j 1 J 1 2 J N 0

    The term in the log likelihood that corresponds to the choice of alternative q is Prob choice q Prob Uq U j j 1 J j q The probability for this occurrence is Prob choice q Prob 1 q xq x1 J q xq x J

    Greene 50240

    book

    June 27 2002

    22 39

    728

    CHAPTER 21 Models for Discrete Choice

    for the J 1 other choices which is a cumulative probability from a J 1 variate normal distribution As in the HEV model since we are only making comparisons one of the variances in this J 1 variate structure that is one of the diagonal elements in the reduced must be normalized to 1 0 Since only comparisons are ever observable in this model for identi cation J 1 of the covariances must also be normalized to zero The MNP model allows an unrestricted J 1 J 1 correlation structure and J 2 free standard deviations for the disturbances in the model Thus a two choice model returns to the univariate probit model of Section 21 2 For more than two choices this speci cation is far more general than the MNL model which assumes that I The scaling is absorbed in the coef cient vector in the MNL model The main obstacle to implementation of the MNP model has been the dif culty in computing the multivariate normal probabilities for any dimensionality higher than 2 Recent results on accurate simulation of multinormal integrals however have made estimation of the MNP model feasible See Section E 5 6 and a symposium in the November 1994 issue of the Review of Economics and Statistics Yet some practical problems remain Computation is exceedingly time consuming It is also necessary to ensure that remain a positive de nite matrix One way often suggested is to construct the Cholesky decomposition of LL where L is a lower triangular matrix and estimate the elements of L Maintaining the normalizations and zero restrictions will still be cumbersome however An alternative is estimate the correlations R and a diagonal matrix of standard deviations S diag 1 J 2 1 1 separately The normalizations R j j 1 and exclusions R Jl 0 are simple to impose and is just SRS R is otherwise restricted only in that 1 R jl 1 The resulting matrix must be positive de nite Identi cation appears to be a serious problem with the MNP model Although the unrestricted MNP model is fully identi ed in principle convergence to satisfactory results in applications with more than three choices appears to require many additional restrictions on the standard deviations and correlations such as zero restrictions or equality restrictions in the case of the standard deviations
    21 7 7 A RANDOM PARAMETERS MODEL

    Another variant of the multinomial logit model is the random parameters logit RPL model also called the mixed logit model See Revelt and Train 1996 Bhat 1996 Berry Levinsohn and Pakes 1995 and Jain Vilcassim and Chintagunta 1994 Train s formulation of the RPL model which encompasses the others is a modi cation of the MNL model The model is a random coef cients formulation The change to the basic MNL model is the parameter speci cation in the distribution of the parameters across individuals i ik k zi k kuik where uik is normally distributed with correlation matrix R k is the standard deviation of the distribution k zi k is the mean of the distribution and zi is a vector of person speci c characteristics such as age and income that do not vary across choices This formulation contains all the earlier models For example if k 0 for all the coef cients and k 0 for all the coef cients except for choice speci c constants then the original MNL model with a normal logistic mixture for the random part of the MNL model arises hence the name

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    729

    The authors propose estimation of the model by simulating the log likelihood function rather than direct integration to compute the probabilities which would be infeasible because the mixture distribution composed of the original i j and the random part of the coef cient is unknown For any individual Prob choice q ui MNL probability i ui with all restrictions imposed on the coef cients The appropriate probability is Eu Prob choice q u
    u1 uk

    Prob choice q u f u du

    which can be estimated by simulation using 1 Est Eu Prob choice q u R
    R

    Prob choice q i eir
    r 1

    where eir is the r th of R draws for observation i There are nkR draws in total The draws for observation i must be the same from one computation to the next which can be accomplished by assigning to each individual their own seed for the random number generator and restarting it each time the probability is to be computed By this method the log likelihood and its derivatives with respect to k k k k 1 K and R are simulated to nd the values that maximize the simulated log likelihood This is precisely the approach we used in Example 17 10 The RPL model enjoys a considerable advantage not available in any of the other forms suggested In a panel data setting one can formulate a random effects model simply by making the variation in the coef cients time invariant Thus the model is changed to Ui jt xi jt i jt i jt i 1 n j 1 J t 1 T i jt k k zi t ik kuik The time variation in the coef cients is provided by the choice invariant variables which may change through time Habit persistence is carried by the time invariant random effect uik If only the constant terms vary and they are assumed to be uncorrelated then this is logically equivalent to the familiar random effects model But much greater generality can be achieved by allowing the other coef cients to vary randomly across individuals and by allowing correlation of these effects 62
    21 7 8 APPLICATION CONDITIONAL LOGIT MODEL FOR TRAVEL MODE CHOICE

    Hensher and Greene Greene 1995a report estimates of a model of travel mode choice for travel between Sydney and Melbourne Australia The data set contains 210 observations on choice among four travel modes air train bus and car See Appendix Table F21 2 The attributes used for their example were choice speci c constants two choice speci c continuous measures GC a measure of the generalized cost of the travel that is equal to the sum of in vehicle cost INVC and a wagelike measure
    62 See

    Hensher 2001 for an application to transportation mode choice in which each individual is observed in several choice situations

    Greene 50240

    book

    June 27 2002

    22 39

    730

    CHAPTER 21 Models for Discrete Choice

    TABLE 21 10 GC

    Summary Statistics for Travel Mode Choice Data
    TTME INVC INVT HINC Number Choosing p True prop

    Air Train Bus Car

    102 648 113 522 130 200 106 619 115 257 108 133 94 414 89 095

    61 010 46 534 35 690 28 524 41 650 25 200 0 0

    85 522 97 569 51 338 37 460 33 457 33 733 20 995 15 694

    133 710 124 828 608 286 532 667 629 462 618 833 573 205 527 373

    34 548 41 274 34 548 23 063 34 548 29 700 34 548 42 220

    58 63 30 59

    0 28 0 30 0 14 0 28

    0 14 0 13 0 09 0 64

    Note The upper gure is the average for all 210 observations The lower gure is the mean for the observations that made that choice

    times INVT the amount of time spent traveling and TTME the terminal time zero for car and for the choice between air and the other modes HINC the household income A summary of the sample data is given in Table 21 10 The sample is choice based so as to balance it among the four choices the true population allocation as shown in the last column of Table 21 10 is dominated by drivers The model speci ed is Ui j air di air train di train bus di bus G GCi j T TTMEi j H di air HINCi i j where for each j i j has the same independent type 1 extreme value distribution F i j exp exp i j which has standard deviation 2 6 The mean is absorbed in the constants Estimates of the conditional logit model are shown in Table 21 11 The model was t with and without the corrections for choice based sampling Since the sample shares do not differ radically from the population proportions the effect on the estimated parameters is fairly modest Nonetheless it is apparent that the choice based sampling is not completely innocent A cross tabulation of the predicted versus actual outcomes is given in Table 21 12 210 The predictions are generated by tabulating the integer parts of m jk i 1 pi j dik
    TABLE 21 11

    Parameter Estimates t Values in Parentheses
    Choice Based Weighting Estimate t Ratio t Ratio

    Unweighted Sample Estimate

    G 0 15501 3 517 0 01333 2 724 T 0 19612 9 207 0 13405 7 164 H 0 01329 1 295 0 00108 0 087 air 5 2074 6 684 6 5940 5 906 train 3 8690 8 731 3 6190 7 447 bus 3 1632 7 025 3 3218 5 698 Log likelihood at 0 291 1218 291 1218 Log likelihood sample shares 283 7588 223 0578 Log likelihood at convergence 199 1284 147 5896

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    731

    TABLE 21 12

    Predicted Choices Based on Model Probabilities Predictions Based on Choice Based Sampling are in Parentheses
    Air Train Bus Car Total Actual

    Air Train Bus Car Total Predicted

    32 30 7 3 3 1 16 5 58 39

    8 3 37 30 5 2 13 5 63 40

    5 3 5 3 15 4 6 3 30 23

    13 23 14 27 6 12 25 45 59 108

    58 63 30 59 210

    j k air train bus car where pi j is the predicted probability of outcome j for obser vation i and dik is the binary variable which indicates if individual i made choice k Are the odds ratios train bus and car bus really independent from the presence of the air alternative To use the Hausman test we would eliminate choice air from the choice set and estimate a three choice model Since 58 respondents chose this mode we would lose 58 observations In addition for every data vector left in the sample the air speci c constant and the interaction di air HINCi would be zero for every remaining individual Thus these parameters could not be estimated in the restricted model We would drop these variables The test would be based on the two estimators of the remaining four coef cients in the model G T train bus The results for the test are as shown in Table 21 13 The hypothesis that the odds ratios for the other three choices are independent from air would be rejected based on these results as the chi squared statistic exceeds the critical value Since IIA was rejected they estimated a nested logit model of the following type
    Travel FLY AIR TRAIN GROUND BUS CAR Determinants Income G cost T time

    TABLE 21 13 G

    Results for IIA Test
    Full Choice Set T train bus G Restricted Choice Set T train bus

    Estimate G T train bus

    0 0155 0 194e 5 0 46e 7 0 00060 0 00026

    0 0961

    3 869

    3 163

    0 0639 0 000101 0 0000013 0 000244 0 000113

    0 0699

    4 464

    3 105

    Estimated Asymptotic Covariance Matrix

    Estimated Asymptotic Covariance Matrix

    0 000110 0 0038 0 0037

    0 196 0 161

    0 203

    0 000221 0 00759 0 00753

    0 410 0 336

    0 371

    Note 0 nnne p indicates times 10 to the negative p power H 33 3363 Critical chi squared 4 9 488

    Greene 50240

    book

    June 27 2002

    22 39

    732

    CHAPTER 21 Models for Discrete Choice

    TABLE 21 14 Parameter

    Estimates of a Mode Choice Model Standard Errors in Parentheses
    LIML Estimate Unconditional

    FIML Estimate

    air bus train GC TTME H yu ground y ground log L

    6 042 4 096 5 065 0 03159 0 1126 0 01533 0 5860 0 3890 2 1886 3 2974 193 6561

    1 199 0 615 0 662 0 00816 0 0141 0 00938 0 141 0 124 0 525 1 048

    0 0647 2 1485 3 105 0 609 4 464 0 641 0 06368 0 0100 0 0699 0 0149 0 02079 0 01128 0 2266 0 296 0 1587 0 262 5 675 2 350 8 081 4 219 115 3354 87 9382

    5 207 3 163 3 869 0 1550 0 09612 0 01329 1 0000 1 0000 1 2825 1 2825 199 1284

    0 779 0 450 0 443 0 00441 0 0104 0 0103 0 000 0 000 0 000 0 000

    Note that one of the branches has only a single choice so the conditional probability Pj y Pair y 1 The model is t by both FIML and LIML methods Three sets of estimates are shown in Table 21 14 The set marked unconditional are the simple conditional multinomial logit MNL model for choice among the four alternatives that was reported earlier Both inclusive value parameters are constrained by construction to equal 1 0000 The FIML estimates are obtained by maximizing the full log likelihood for the nested logit model In this model Prob choice branch P air dair train dtrain bus dbus G GC T TTME Prob branch P dair HINC y I V y gr ound I Vgr ound Prob choice branch Prob choice branch Prob branch Finally the limited information estimator is estimated in two steps At the rst step a choice model is estimated for the three choices in the ground branch Prob choice ground P train dtrain bus dbus G GC T TTME This model uses only the observations that chose one of the three ground modes for these data this subset was 152 of the 210 observations Using the estimates from this model we compute for all 210 observations I V y log exp zair for air and 0 for ground and I Vground log j ground exp z j for ground modes and 0 for air Then the choice model Prob branch P air dair H dair HINC y I V y ground I Vground is t separately Since the Hessian is not block diagonal the FIML estimator is more ef cient To obtain appropriate standard errors we must make the Murphy and Topel correction for two step estimation see Section 17 7 and Theorem 17 8 It is simpli ed a bit here because different samples are used for the two steps As such the matrix R in the theorem is not computed To compute C we require the matrix of derivatives of log Prob branch with respect to the direct parameters air H y ground and with respect to the choice parameters Since this model is a simple binomial two choice logit model these are easy to compute using 21 19 Then the corrected asymptotic covariance matrix is computed using Theorem 17 8 with R 0

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    733

    TABLE 21 15 Parameter

    Estimates of a Heteroscedastic Extreme Value Model Standard Errors in Parentheses
    HEV Estimate Nested Logit Estimate Restricted HEV

    air bus train GC TTME y ground air train bus car air train bus car ln L

    7 8326 7 1718 6 8655 0 05156 0 1968 0 04024 0 2485 0 2595 0 6065 1 0000 5 161 4 942 2 115 1 283 195 6605

    10 951 9 135 8 829 0 0694 0 288 0 0607 0 369 0 418 1 040 0 000 7 667 7 978 3 623 0 000

    6 062 4 096 5 065 0 03159 0 1126 0 01533 0 5860 0 3890

    1 199 0 615 0 662 0 00816 0 0141 0 00938 0 141 0 124

    2 973 4 050 3 042 0 0289 0 0828 0 0238 0 4959 1 0000 1 0000 1 0000

    0 995 0 494 0 429 0 00580 0 00576 0 0186 0 124 0 000 0 000 0 000

    Implied Standard Deviations

    193 6561

    200 3791

    The likelihood ratio statistic for the nesting heteroscedasticity against the null hypothesis of homoscedasticity is 2 199 1284 193 6561 10 945 The 95 percent critical value from the chi squared distribution with two degrees of freedom is 5 99 so the hypothesis is rejected We can also carry out a Wald test The asymptotic covariance matrix for the two inclusive value parameters is 0 01977 0 009621 0 01529 The Wald statistic for the joint test of the hypothesis that y ground 1 is W 0 586 1 0 0 389 1 0 0 1977 0 009621 0 009621 0 01529
    1

    0 586 1 0 0 389 1 0

    24 475

    The hypothesis is rejected once again The nested logit model was reestimated under assumptions of the heteroscedastic extreme value model The results are shown in Table 21 15 This model is less restrictive than the nested logit model To make them comparable we note that we found that air air 6 2 1886 and train bus car ground 6 3 2974 The heteroscedastic extreme value HEV model thus relaxes one variance restriction because it has three free variance parameters instead of two On the other hand the important degree of freedom here is that the HEV model does not impose the IIA assumption anywhere in the choice set whereas the nested logit does within each branch A primary virtue of the HEV model the nested logit model and other alternative models is that they relax the IIA assumption This assumption has implications for the cross elasticities between attributes in the different probabilities Table 21 16 lists the estimated elasticities of the estimated probabilities with respect to changes in the generalized cost variable Elasticities are computed by averaging the individual sample values rather than computing them once at the sample means The implication of the IIA

    Greene 50240

    book

    June 27 2002

    22 39

    734

    CHAPTER 21 Models for Discrete Choice

    TABLE 21 16

    Estimated Elasticities with Respect to Generalized Cost
    Cost Is That of Alternative Air Train Bus Car

    Effect on Multinomial Logit

    Air Train Bus Car
    Nested Logit

    1 136 0 456 0 456 0 456 0 858 0 314 0 314 0 314 1 040 0 272 0 688 0 690

    0 498 1 520 0 498 0 498 0 332 4 075 1 595 1 595 0 367 1 495 0 858 0 930

    0 238 0 238 1 549 0 238 0 179 0 887 4 132 0 887 0 221 0 250 6 562 1 254

    0 418 0 418 0 418 1 061 0 308 1 657 1 657 2 498 0 441 0 553 3 384 2 717

    Air Train Bus Car Air Train Bus Car

    Heteroscedastic Extreme Value

    assumption can be seen in the table entries Thus in the estimates for the multinomial logit MNL model the cross elasticities for each attribute are all equal In the nested logit model the IIA property only holds within the branch Thus in the rst column the effect of GC of air affects all ground modes equally whereas the effect of GC for train is the same for bus and car but different from these two for air All these elasticities vary freely in the HEV model Table 21 17 lists the estimates of the parameters of the multinomial probit and random parameters logit models For the multinomial probit model we t three speci cations 1 free correlations among the choices which implies an unrestricted 3 3 correlation matrix and two free standard deviations 2 uncorrelated disturbances but free standard deviations a model that parallels the heteroscedastic extreme value model and 3 uncorrelated disturbances and equal standard deviations a model that is the same as the original conditional logit model save for the normal distribution of the disturbances instead of the extreme value assumed in the logit model In this case the scaling of the utility functions is different by a factor of 2 6 1 2 1 283 as the probit model assumes j has a standard deviation of 1 0 We also t three variants of the random parameters logit In these cases the choice speci c variance for each utility function is j2 2 where j2 is the contribution of j the logit model which is 2 6 1 645 and 2 is the estimated constant speci c varij ance estimated in the random parameters model The combined estimated standard deviations are given in the table The estimates of the speci c parameters j are given in the footnotes The estimated models are 1 unrestricted variation and correlation among the three intercept parameters this parallels the general speci cation of the multinomial probit model 2 only the constant terms randomly distributed but uncorrelated a model that is parallel to the multinomial probit model with no cross equation correlation and to the heteroscedastic extreme value model shown in Table 21 15

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    735

    TABLE 21 17

    Parameter Estimates for Normal Based Multinomial Choice Models
    Multinomial Probit Random Parameters Logit Unrestricted Constants Uncorrelated

    Parameter

    Unrestricted Homoscedastic Uncorrelated

    air air train train bus bus car car G G T T H AT AB BT log L

    1 358 4 940 4 298 1 899 3 609 1 000a 0 000a 1 000a 0 0351 0 0769 0 0593 0 581 0 576 0 718 196 9244

    3 005 1 000a 2 409 1 000a 1 834 1 000a 0 000a 1 000 0 0113 0 0563 0 0126 0 000a 0 000a 0 000a 208 9181

    3 171 3 629 4 277 1 581 3 533 1 000a 0 000a 1 000a 0 0325 0 0918 0 0370 0 000a 0 000a 0 000a 199 7623

    5 519 4 009d 5 776 1 904 4 813 1 424 0 000a 1 283a 0 0326 0 000a 0 126 0 000a 0 0334 0 000a 0 543 0 532 0 993 193 7160

    4 807 3 225b 5 035 1 290b 4 062 3 147b 0 000a 1 283a 0 0317 0 000a 0 112 0 000a 0 0319 0 000a 0 000a 0 000a 0 000a 199 0073

    12 603 2 803c 13 504 1 373 11 962 1 287 0 000 1 283a 0 0544 0 00561 0 2822 0 182 0 0846 0 0768 0 000a 0 000a 0 000a 175 5333

    a Restricted to this xed value b Computed as the square root of 2 6 2 air 2 959 train 0 136 bus j c air 2 492 train 0 489 bus 0 108 car 0 000 d Derived standard deviations for the random constants are air 3 798 train

    0 183 car 0 000 1 182 bus 0 0712 car 0 000

    3 random but uncorrelated parameters This model is more general than the others but is somewhat restricted as the parameters are assumed to be uncorrelated Identi cation of the correlation model is weak in this model after all we are attempting to estimate a 6 6 correlation matrix for all unobserved variables Only the estimated parameters are shown in Table 21 17 Estimated standard errors are similar to although generally somewhat larger than those for the basic multinomial logit model The standard deviations and correlations shown for the multinomial probit model are parameters of the distribution of i j the overall randomness in the model The counterparts in the random parameters model apply to the distributions of the parameters Thus the full disturbance in the model in which only the constants are random is iair uair for air and likewise for train and bus Likewise the correlations shown for the rst two models are directly comparable though it should be noted that in the random parameters model the disturbances have a distribution that is that of a sum of an extreme value and a normal variable while in the probit model the disturbances are normally distributed With these considerations the unrestricted models in each case are comparable and are in fact fairly similar None of this discussion suggests a preference for one model or the other The likelihood values are not comparable so a direct test is precluded Both relax the IIA assumption which is a crucial consideration The random parameters model enjoys a signi cant practical advantage as discussed earlier and also allows a much richer speci cation of the utility function itself But he question still warrants additional study Both models are making their way into the applied literature

    Greene 50240

    book

    June 27 2002

    22 39

    736

    CHAPTER 21 Models for Discrete Choice

    21 8

    ORDERED DATA

    Some multinomial choice variables are inherently ordered Examples that have appeared in the literature include the following 1 2 3 4 5 6 7 Bond ratings Results of taste tests Opinion surveys The assignment of military personnel to job classi cations by skill level Voting outcomes on certain programs The level of insurance coverage taken by a consumer none part or full Employment unemployed part time or full time

    In each of these cases although the outcome is discrete the multinomial logit or probit model would fail to account for the ordinal nature of the dependent variable 63 Ordinary regression analysis would err in the opposite direction however Take the outcome of an opinion survey If the responses are coded 0 1 2 3 or 4 then linear regression would treat the difference between a 4 and a 3 the same as that between a 3 and a 2 whereas in fact they are only a ranking The ordered probit and logit models have come into fairly wide use as a framework for analyzing such responses Zavoina and McElvey 1975 The model is built around a latent regression in the same manner as the binomial probit model We begin with y x As usual y is unobserved What we do observe is y 0 1 2 J if J 1 y which is a form of censoring The s are unknown parameters to be estimated with Consider for example an opinion survey The respondents have their own intensity of feelings which depends on certain measurable factors x and certain unobservable factors In principle they could respond to the questionnaire with their own y if asked to do so Given only say ve possible answers they choose the cell that most closely represents their own feelings on the question
    63 In

    if y 0 if 0 y 1 if 1 y 2

    two papers Beggs Cardell and Hausman 1981 and Hausman and Ruud 1986 the authors analyze a richer speci cation of the logit model when respondents provide their rankings of the full set of alternatives in addition to the identity of the most preferred choice This application falls somewhere between the conditional logit model and the ones we shall discuss here in that rather than provide a single choice among J either unordered or ordered alternatives the consumer chooses one of the J possible orderings of the set of unordered alternatives

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    737

    0 4

    0 3

    f

    0 2

    0 1

    0

    y

    0y x

    1
    1

    y x

    2
    2

    y x

    3
    3

    y x

    4

    FIGURE 21 4

    Probabilities in the Ordered Probit Model

    As before we assume that is normally distributed across observations 64 For the same reasons as in the binomial probit model which is the special case of J 1 we normalize the mean and variance of to zero and one We then have the following probabilities Prob y 0 x Prob y 1 x Prob y 2 x x 1 x 2 x x 1 x

    Prob y J x 1

    J 1 x

    For all the probabilities to be positive we must have 0 1 2 J 1 Figure 21 4 shows the implications of the structure This is an extension of the univariate probit model we examined earlier The log likelihood function and its derivatives can be obtained readily and optimization can be done by the usual means As usual the marginal effects of the regressors x on the probabilities are not equal to the coef cients It is helpful to consider a simple example Suppose there are three categories The model thus has only one unknown threshold parameter The three
    64 Other

    distributions particularly the logistic could be used just as easily We assume the normal purely for convenience The logistic and normal distributions generally give similar results in practice

    Greene 50240

    book

    June 27 2002

    22 39

    738

    CHAPTER 21 Models for Discrete Choice

    probabilities are Prob y 0 x 1 Prob y 1 x Prob y 2 x 1 x x x

    x

    For the three probabilities the marginal effects of changes in the regressors are Prob y 0 x x x Prob y 1 x x x x Prob y 2 x x x Figure 21 5 illustrates the effect The probability distributions of y and y are shown in the solid curve Increasing one of the x s while holding and constant is equivalent to shifting the distribution slightly to the right which is shown as the dashed curve The effect of the shift is unambiguously to shift some mass out of the leftmost cell Assuming that is positive for this x Prob y 0 x must decline Alternatively from the previous expression it is obvious that the derivative of Prob y 0 x has the opposite sign from By a similar logic the change in Prob y 2 x or Prob y J x in the general case must have the same sign as Assuming that the particular is positive we are shifting some probability into the rightmost cell But what happens to the middle cell is ambiguous It depends on the two densities In the general case relative to the signs of the coef cients only the signs of the changes in Prob y 0 x and Prob y J x are unambiguous The upshot is that we must be very careful

    FIGURE 21 5

    Effects of Change in x on Predicted Probabilities

    0 4

    0 3

    0 2

    0 1

    0

    0

    1

    2

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    739

    TABLE 21 18

    Estimated Rating Assignment Equation
    t Ratio Mean of Variable

    Variable

    Estimate

    Constant ENSPA EDMA AFQT EDYRS MARR AGEAT

    4 34 0 057 0 007 0 039 0 190 0 48 0 0015 1 79

    1 7 0 8 39 9 8 7 9 0 0 1 80 8

    0 66 12 1 71 2 12 1 0 08 18 8

    in interpreting the coef cients in this model Indeed without a fair amount of extra calculation it is quite unclear how the coef cients in the ordered probit model should be interpreted 65
    Example 21 11 Rating Assignments

    Marcus and Greene 1985 estimated an ordered probit model for the job assignments of new Navy recruits The Navy attempts to direct recruits into job classi cations in which they will be most productive The broad classi cations the authors analyzed were technical jobs with three clearly ranked skill ratings medium skilled highly skilled and nuclear quali ed highly skilled Since the assignment is partly based on the Navy s own assessment and needs and partly on factors speci c to the individual an ordered probit model was used with the following determinants 1 ENSPE a dummy variable indicating that the individual entered the Navy with an A school technical training guarantee 2 EDMA educational level of the entrant s mother 3 AFQT score on the Air Force Qualifying Test 4 EDYRS years of education completed by the trainee 5 MARR a dummy variable indicating that the individual was married at the time of enlistment and 6 AGEAT trainee s age at the time of enlistment The sample size was 5 641 The results are reported in Table 21 18 The extremely large t ratio on the AFQT score is to be expected since it is a primary sorting device used to assign job classi cations To obtain the marginal effects of the continuous variables we require the standard normal density evaluated at x 0 8479 and x 0 9421 The predicted probabilities are 0 8479 0 198 0 9421 0 8479 0 628 and 1 0 9421 0 174 The actual frequencies were 0 25 0 52 and 0 23 The two densities are 0 8479 0 278 and 0 9421 0 255 Therefore the derivatives of the three probabilities with respect to AFQT for example are P0 0 278 0 039 0 01084 AFQT P1 0 278 0 255 0 039 0 0009 AFQT P2 0 255 0 039 0 00995 AFQT
    65 This

    point seems uniformly to be overlooked in the received literature Authors often report coef cients and t ratios occasionally with some commentary about signi cant effects but rarely suggest upon what or in what direction those effects are exerted

    Greene 50240

    book

    June 27 2002

    22 39

    740

    CHAPTER 21 Models for Discrete Choice

    TABLE 21 19

    Marginal Effect of a Binary Variable
    x x Prob y 0 Prob y 1 Prob y 2

    MARR 0 MARR 1 Change

    0 8863 0 4063

    0 9037 1 3837

    0 187 0 342 0 155

    0 629 0 574 0 055

    0 184 0 084 0 100

    Note that the marginal effects sum to zero which follows from the requirement that the probabilities add to one This approach is not appropriate for evaluating the effect of a dummy variable We can analyze a dummy variable by comparing the probabilities that result when the variable takes its two different values with those that occur with the other variables held at their sample means For example for the MARR variable we have the results given in Table 21 19

    21 9

    MODELS FOR COUNT DATA

    Data on patents suggested in Section 21 2 are typical of count data In principle we could analyze these data using multiple linear regression But the preponderance of zeros and the small values and clearly discrete nature of the dependent variable suggest that we can improve on least squares and the linear model with a speci cation that accounts for these characteristics The Poisson regression model has been widely used to study such data 66 The Poisson regression model speci es that each yi is drawn from a Poisson distribution with parameter i which is related to the regressors xi The primary equation of the model is y e i i i Prob Yi yi xi yi 0 1 2 yi The most common formulation for i is the loglinear model ln i xi It is easily shown that the expected number of events per period is given by E yi xi Var yi xi i exi so E yi xi i xi With the parameter estimates in hand this vector can be computed using any data vector desired In principle the Poisson model is simply a nonlinear regression 67 But it is far easier to estimate the parameters with maximum likelihood techniques The log likelihood
    66 There 67 We

    are several recent surveys of speci cation and estimation of models for counts Among them are Cameron and Trivedi 1998 Greene 1996a Winkelmann 2000 and Wooldridge 1997 have estimated a Poisson regression model using two step nonlinear least squares in Example 17 9

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    741

    function is
    n

    ln L
    i 1

    i yi xi ln yi

    The likelihood equations are ln L The Hessian is 2 ln L
    n n

    yi i xi 0
    i 1

    i xi xi
    i 1

    The Hessian is negative de nite for all x and Newton s method is a simple algorithm for this model and will usually converge rapidly At convergence in 1 i xi xi 1 provides an estimator of the asymptotic covariance matrix for the parameter estimates Given the estimates the prediction for observation i is i exp x A standard error for the prediction interval can be formed by using a linear Taylor series approximation 2 The estimated variance of the prediction will be i xi Vxi where V is the estimated asymptotic covariance matrix for For testing hypotheses the three standard tests are very convenient in this model The Wald statistic is computed as usual As in any discrete choice model the likelihood ratio test has the intuitive form
    n

    LR 2
    i 1

    ln

    Pi restricted i P

    where the probabilities in the denominator are computed with using the restricted model Using the BHHH estimator for the asymptotic covariance matrix the LM statistic is simply
    n n 1 n

    LM
    i 1

    xi yi i
    i 1

    xi xi yi i

    2 i 1

    xi yi i i G G G 1 G i

    where each row of G is simply the corresponding row of X multiplied by ei yi i i is computed using the restricted coef cient vector and i is a column of ones
    21 9 1 MEASURING GOODNESS OF FIT

    The Poisson model produces no natural counterpart to the R2 in a linear regression model as usual because the conditional mean function is nonlinear and moreover because the regression is heteroscedastic But many alternatives have been suggested 68
    68 See

    the surveys by Cameron and Windmeijer 1993 Gurmu and Trivedi 1994 and Greene 1995b

    Greene 50240

    book

    June 27 2002

    22 39

    742

    CHAPTER 21 Models for Discrete Choice

    A measure based on the standardized residuals is
    n i 1 n i 1 yi i i y y i y 2

    2 Rp 1

    2



    This measure has the virtue that it compares the t of the model with that provided by a model with only a constant term But it can be negative and it can fall when a variable is dropped from the model For an individual observation the deviance is di 2 yi ln yi i yi i 2 yi ln yi i ei where by convention 0 ln 0 0 If the model contains a constant term then The sum of the deviances
    n n n i 1 ei

    0

    G2
    i 1

    di 2
    i 1

    yi ln yi i

    is reported as an alternative t measure by some computer programs This statistic will equal 0 0 for a model that produces a perfect t Note that since yi is an integer while the prediction is continuous it could not happen Cameron and Windmeijer 1993 suggest that the t measure based on the deviances
    n i 1

    yi log
    n i 1

    2 Rd 1

    yi i yi log

    yi i yi y

    has a number of desirable properties First denote the log likelihood function for the model in which i is used as the prediction e g the mean of yi as i yi The Poisson model t by MLE is then i yi the model with only a constant term is y yi and a model that achieves a perfect t by predicting yi with itself is l yi yi Then
    2 Rd

    yi y yi yi yi y yi

    Both numerator and denominator measure the improvement of the model over one with only a constant term The denominator measures the maximum improvement since one cannot improve on a perfect t Hence the measure is bounded by zero and one and increases as regressors are added to the model 69 We note nally the passing 2 resemblance of Rd to the pseudo R2 or likelihood ratio index reported by some statistical packages e g Stata
    2 RLRI 1

    i yi y yi

    69 Note

    that multiplying both numerator and denominator by 2 produces the ratio of two likelihood ratio statistics each of which is distributed as chi squared

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    743

    Many modi cations of the Poisson model have been analyzed by economists 70 In this and the next few sections we brie y examine a few of them
    21 9 2 TESTING FOR OVERDISPERSION

    The Poisson model has been criticized because of its implicit assumption that the variance of yi equals its mean Many extensions of the Poisson model that relax this assumption have been proposed by Hausman Hall and Griliches 1984 McCullagh and Nelder 1983 and Cameron and Trivedi 1986 to name but a few The rst step in this extended analysis is usually a test for overdispersion in the context of the simple model A number of authors have devised tests for overdispersion within the context of the Poisson model See Cameron and Trivedi 1990 Gurmu 1991 and Lee 1986 We will consider three of the common tests one based on a regression approach one a conditional moment test and a third a Lagrange multiplier test based on an alternative model Conditional moment tests are developed in Section 17 6 4 Cameron and Trivedi 1990 offer several different tests for overdispersion A simple regression based procedure used for testing the hypothesis H0 Var yi E yi H1 Var yi E yi g E yi is carried out by regressing zi yi i 2 yi i 2

    where i is the predicted value from the regression on either a constant term or i without a constant term A simple t test of whether the coef cient is signi cantly different from zero tests H0 versus H1 Cameron and Trivedi s regression based test for overdispersion is formulated around the alternative Var yi E yi g E yi This is a very speci c type of overdispersion Consider the more general hypothesis that Var yi is completely given by E yi The alternative is that the variance is systematically related to the regressors in a way that is not completely accounted for by E yi Formally we have E yi exp xi i The null hypothesis is that Var yi i as well We can test the hypothesis using the conditional moment test described in Section 17 6 4 The expected rst derivatives and the moment restriction are E xi yi i 0 and E zi yi i 2 i 0

    To carry out the test we do the following Let ei yi i and zi xi without the constant term 1 2 Compute the Poisson regression by maximum likelihood Compute r in 1 zi ei2 i in 1 zvi based on the maximum likelihood estimates

    70 There

    have been numerous surveys of models for count data including Cameron and Trivedi 1986 and Gurmu and Trivedi 1994

    Greene 50240

    book

    June 27 2002

    22 39

    744

    CHAPTER 21 Models for Discrete Choice

    3 4 5

    Compute M M in 1 zi zi vi2 D D in 1 xi xi ei2 and M D in 1 zi xi vi ei Compute S M M M D D D 1 D M C r S 1 r is the chi squared statistic It has K degrees of freedom

    The next section presents the negative binomial model This model relaxes the Poisson assumption that the mean equals the variance The Poisson model is obtained as a parametric restriction on the negative binomial model so a Lagrange multiplier test can be computed In general if an alternative distribution for which the Poisson model is obtained as a parametric restriction such as the negative binomial model can be speci ed then a Lagrange multiplier statistic can be computed See Cameron and Trivedi 1986 p 41 The LM statistic is LM
    n i 1

    wi yi i 2 yi 2
    n i 1 2 wi i

    2



    The weight wi depends on the assumed alternative distribution For the negative binomial model discussed later wi equals 1 0 Thus under this alternative the statistic is particularly simple to compute LM e e n y 2 2

    The main advantage of this test statistic is that one need only estimate the Poisson model to compute it Under the hypothesis of the Poisson model the limiting distribution of the LM statistic is chi squared with one degree of freedom
    21 9 3 HETEROGENEITY AND THE NEGATIVE BINOMIAL REGRESSION MODEL

    The assumed equality of the conditional mean and variance functions is typically taken to be the major shortcoming of the Poisson regression model Many alternatives have been suggested see Hausman Hall and Griliches 1984 Cameron and Trivedi 1986 1998 Gurmu and Trivedi 1994 Johnson and Kotz 1993 and Winkelmann 1997 for discussion The most common is the negative binomial model which arises from a natural formulation of cross section heterogeneity We generalize the Poisson model by introducing an individual unobserved effect into the conditional mean ln i xi i ln i ln ui where the disturbance i re ects either speci cation error as in the classical regression model or the kind of cross sectional heterogeneity that normally characterizes microeconomic data Then the distribution of yi conditioned on xi and ui i e i remains Poisson with conditional mean and variance i f yi xi ui e i ui i ui yi yi

    The unconditional distribution f yi xi is the expected value over ui of f yi xi ui f yi xi
    0

    e i ui i ui yi g ui dui yi

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    745

    The choice of a density for ui de nes the unconditional distribution For mathematical convenience a gamma distribution is usually assumed for ui exp i 71 As in other models of heterogeneity the mean of the distribution is unidenti ed if the model contains a constant term because the disturbance enters multiplicatively so E exp i is assumed to be 1 0 With this normalization g ui The density for yi is then f yi xi
    0

    ui 1 ui e

    e i ui i ui yi ui 1 e ui dui yi
    y 0



    i i yi 1
    y

    e i ui ui

    yi 1

    dui



    i i yi yi 1 i yi yi i y ri i 1 ri where ri yi 1 i

    which is one form of the negative binomial distribution The distribution has conditional mean i and conditional variance i 1 1 i This model is Negbin II in Cameron and Trivedi s 1986 presentation The negative binomial model can be estimated by maximum likelihood without much dif culty A test of the Poisson distribution is often carried out by testing the hypothesis 0 using the Wald or likelihood ratio test
    21 9 4 APPLICATION THE POISSON REGRESSION MODEL

    The number of accidents per service month for a sample of ship types is listed in Appendix Table F21 3 The ships are of ve types constructed in one of four periods The observation is over two periods Since ships constructed from 1975 to 1979 could not have operated from 1960 to 1974 there is one missing observation in each group The second observation for group E is also missing for reasons unexplained by the authors 72 The substantive variables in the model are number of accidents in the observation period and aggregate number of service months for the ship type by construction year for the period of operation Estimates of the parameters of a Poisson regression model are shown in Table 21 20 The model is ln E accident per month x
    71 An

    alternative approach based on the normal distribution is suggested in Terza 1998 Greene 1995a 1997a and Winkelmann 1997 The normal Poisson mixture is also easily extended to the random effects model discussed in the next section There is no closed form for the normal Poisson mixture model but it can be easily approximated by using Hermite quadrature are from McCullagh and Nelder 1983 See Exercise 8 in Chapter 7 for details

    72 Data

    Greene 50240

    book

    June 27 2002

    22 39

    746

    CHAPTER 21 Models for Discrete Choice

    TABLE 21 20

    Estimated Poisson Regressions Standard Errors in Parentheses
    Mean Dependent Variable 10 47 No Period Effect

    Variable

    Full Model

    No Ship Type Effect

    Constant Type A Type B Type C Type D Type E 60 64 65 69 70 74 75 79 Period 60 74 Period 75 79 Log service Log L G2 R2 p 2 Rd

    6 4029 0 5447 0 6888 0 0743 0 3205 0 6959 0 8175 0 4450

    0 2175 0 1776 0 3290 0 2906 0 2358 0 1497 0 1698 0 2332

    6 9470

    0 1269

    5 7999 0 7437 0 7549 0 1843 0 3842

    0 1784 0 1692 0 3276 0 2876 0 2348

    0 7536 1 0503 0 6999

    0 1488 0 1576 0 2203 0 5001 0 1116 1 0000 84 11514 70 34967 0 90001 0 88556

    0 3839 0 1183 1 0000 68 41455 38 96262 0 94560 0 93661

    0 3875 0 1181 1 0000 80 20123 62 53596 0 89384 0 89822

    The model therefore contains the ship type construction period and operation period effects and the aggregate number of months with a coef cient of 1 0 73 The model is shown in Table 21 20 with sets of estimates for the full model and with the model omitting the type and construction period effects Predictions from the estimated full model are shown in the last column of Appendix Table F21 3 The hypothesis that the year of construction is not a signi cant factor in explaining the number of accidents is strongly rejected by the likelihood ratio test 2 2 84 11514 68 41455 31 40118 The critical chi squared value for three degrees of freedom is 7 82 The ship type effect is likewise signi cant 2 2 80 20123 68 41455 23 57336 against a critical value for four degrees of freedom of 9 49 The LM tests for the two restrictions give the same conclusions but much less strongly The value is 28 526 for the ship type effect and 31 418 for the period effects In their analysis of these data McCullagh and Nelder assert without evidence that there is overdispersion in the data Some of their analysis follows on an assumption that the standard deviation of yi is 1 3 times the mean The t statistics for the two regressions in Cameron and Trivedi s regression based tests are 0 934 and 0 613 respectively so based on these tests we do not reject H0 no overdispersion The LM statistic for the same
    73 When

    the length of the period of observation varies by observation by Ti and the model is of the rate of occurrence of events per unit of time then the mean of the observed distribution is Ti i This assumption produces the coef cient of 1 0 on the number of periods of service in the model

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    747

    hypothesis is 0 75044 with one degree of freedom The critical value from the table is 3 84 so again the hypothesis of the Poisson model is not rejected However the conditional moment test is contradictory C r S 1 r 26 555 There are eight degrees of freedom The 5 percent critical value from the chi squared table is 15 507 so the hypothesis is now rejected This test is much more general since the form of overdispersion is not speci ed which may explain the difference Note that this result af rms McCullagh and Nelder s conjecture
    21 9 5 POISSON MODELS FOR PANEL DATA

    The familiar approaches to accommodating heterogeneity in panel data have fairly straightforward extensions in the count data setting Hausman Hall and Griliches 1984 give full details for these models We will examine them for the Poisson model The authors and Allison 2000 also give results for the negative binomial model Consider rst a xed effects approach The Poisson distribution is assumed to have conditional mean log it xit i where now xit has been rede ned to exclude the constant term The approach used in the linear model of transforming yit to group mean deviations does not remove the heterogeneity nor does it leave a Poisson distribution for the transformed variable However the Poisson model with xed effects can be t using the methods described for the probit model in Section 21 5 1b The extension to the Poisson model requires only the minor modi cations git yit it and hit it Everything else in that derivation applies with only a simple change in the notation The rst order conditions for maximizing the log likelihood function for the Poisson model will include ln L i
    T

    yit e i it 0
    t 1

    where it exi t

    This implies an explicit solution for i in terms of in this model i ln 1 n 1 n
    T t 1 yit T t 1 it

    ln

    yi i

    Unlike the regression or the probit model this does not require that there be within group variation in yit all the values can be the same It does require that at least one observation for individual i be nonzero however The rest of the solution for the xed effects estimator follows the same lines as that for the probit model An alternative approach albeit with little practical gain would be to concentrate the log likelihood function by inserting this solution for i back into the original log likelihood then maximizing the resulting function of While logically this makes sense the approach suggested earlier for the probit model is simpler to implement An estimator that is not a function of the xed effects is found by obtaining the joint distribution of yi 1 yi Ti conditional on their sum For the Poisson model a

    Greene 50240

    book

    June 27 2002

    22 39

    748

    CHAPTER 21 Models for Discrete Choice

    close cousin to the logit model discussed earlier is produced
    Ti

    p yi 1 yi 2 yi Ti
    i 1

    yit



    Ti t 1 Ti t 1

    yit yit

    Ti

    pitit
    t 1

    y

    where pit exi t i
    Ti xi t i t 1 e



    exi t
    Ti xi t t 1 e



    The contribution of group i to the conditional log likelihood is
    Ti

    ln Li
    t 1

    yit ln pit

    Note once again that the contribution to ln L of a group in which yit 0 in every period is zero Cameron and Trivedi 1998 have shown that these two approaches give identical results The xed effects approach has the same aws and virtues in this setting as in the probit case It is not necessary to assume that the heterogeneity is uncorrelated with the included exogenous variables If the uncorrelatedness of the regressors and the heterogeneity can be maintained then the random effects model is an attractive alternative model Once again the approach used in the linear regression model partial deviations from the group means followed by generalized least squares see Chapter 13 is not usable here The approach used is to formulate the joint probability conditioned upon the heterogeneity then integrate it out of the joint distribution Thus we form
    Ti

    p yi 1 yi Ti ui
    t 1

    p yit ui

    Then the random effect is swept out by obtaining p yi 1 yi Ti
    ui

    p yi 1 yi Ti ui dui p yi 1 yi Ti ui g ui dui
    ui



    Eui p yi 1 yi Ti ui This is exactly the approach used earlier to condition the heterogeneity out of the Poisson model to produce the negative binomial model If as before we take p yit ui to be Poisson with mean it exp xi t ui in which exp ui is distributed as gamma with mean 1 0 and variance 1 then the preceding steps produce the negative binomial distribution p yi 1 yi Ti
    Ti t 1

    itit yit

    y


    Ti t 1

    Ti t 1

    yit
    Ti t 1

    Ti t 1

    it

    yit

    Qi 1 Qi

    Ti t 1

    yit



    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    749

    where Qi
    Ti t 1

    it



    For estimation purposes we have a negative binomial distribution for Yi t yit with mean i t it There is a mild preference in the received literature for the xed effects estimators over the random effects estimators The virtue of dispensing with the assumption of uncorrelatedness of the regressors and the group speci c effects is substantial On the other hand the assumption does come at a cost In order to compute the probabilities or the marginal effects it is necessarily to estimate the constants i The unscaled coef cients in these models are of limited usefulness because of the nonlinearity of the conditional mean functions Other approaches to the random effects model have been proposed Greene 1994 1995a and Terza 1995 specify a normally distributed heterogeneity on the assumption that this is a more natural distribution for the aggregate of small independent effects Brannas and Johanssen 1994 have suggested a semiparametric approach based on the GMM estimator by superimposing a very general form of heterogeneity on the Poisson model They assume that conditioned on a random effect it yit is distributed as Poisson with mean it it The covariance structure of it is allowed to be fully general For t s 1 T Var it i2 Cov it js i j t s For long time series this model is likely to have far too many parameters to be identi ed without some restrictions such as rst order homogeneity i i uncorrelatedness across groups i j 0 for i j groupwise homoscedasticity i2 2 i and nonautocorrelatedness r 0 r 0 With these assumptions the estimation procedure they propose is similar to the procedures suggested earlier If the model imposes enough restrictions then the parameters can be estimated by the method of moments The authors discuss estimation of the model in its full generality Finally the latent class model discussed in Section 16 2 3 and the random parameters model in Section 17 8 extend naturally to the Poisson model Indeed most of the received applications of the latent class structure have been in the Poisson regression framework See Greene 2001 for a survey
    21 9 6 HURDLE AND ZERO ALTERED POISSON MODELS

    In some settings the zero outcome of the data generating process is qualitatively different from the positive ones Mullahy 1986 argues that this fact constitutes a shortcoming of the Poisson or negative binomial model and suggests a hurdle model as an alternative 74 In his formulation a binary probability model determines whether a zero or a nonzero outcome occurs then in the latter case a truncated Poisson distribution describes the positive outcomes The model is Prob yi 0 xi e Prob yi j xi 1 e e i i j 1 e i
    j

    j 1 2

    74 For

    a similar treatment in a continuous data application see Cragg 1971

    Greene 50240

    book

    June 27 2002

    22 39

    750

    CHAPTER 21 Models for Discrete Choice

    This formulation changes the probability of the zero outcome and scales the remaining probabilities so that the sum to one It adds a new restriction that Prob yi 0 xi no longer depends on the covariates however Therefore a natural next step is to parameterize this probability Mullahey suggests some formulations and applies the model to a sample of observations on daily beverage consumption Mullahey 1986 Heilbron 1989 Lambert 1992 Johnson and Kotz 1993 and Greene 1994 have analyzed an extension of the hurdle model in which the zero outcome can arise from one of two regimes 75 In one regime the outcome is always zero In the other the usual Poisson process is at work which can produce the zero outcome or some other In Lambert s application she analyzes the number of defective items produced by a manufacturing process in a given time interval If the process is under control then the outcome is always zero by de nition If it is not under control then the number of defective items is distributed as Poisson and may be zero or positive in any period The model at work is therefore Prob yi 0 xi Prob regime 1 Prob yi 0 xi regime 2 Prob regime 2 Prob yi j xi Prob yi j xi regime 2 Prob regime 2 j 1 2 Let z denote a binary indicator of regime 1 z 0 or regime 2 z 1 and let y denote the outcome of the Poisson process in regime 2 Then the observed y is z y A natural extension of the splitting model is to allow z to be determined by a set of covariates These covariates need not be the same as those that determine the conditional probabilities in the Poisson process Thus the model is Prob zi 1 wi F wi e i i j
    j

    Prob yi j xi zi 1 The mean in this distribution is

    E yi xi F 0 1 F E yi xi yi 0 1 F

    i 1 e i

    Lambert 1992 and Greene 1994 consider a number of alternative formulations including logit and probit models discussed in Sections 21 3 and 21 4 for the probability of the two regimes Both of these modi cations substantially alter the Poisson formulation First note that the equality of the mean and variance of the distribution no longer follows both modi cations induce overdispersion On the other hand the overdispersion does not arise from heterogeneity it arises from the nature of the process generating the zeros As such an interesting identi cation problem arises in this model If the data do appear to be characterized by overdispersion then it seems less than obvious whether it should be attributed to heterogeneity or to the regime splitting mechanism Mullahy 1986 argues the point more strongly He demonstrates that overdispersion will always induce excess zeros As such in a splitting model we are likely to misinterpret the excess zeros as due to the splitting process instead of the heterogeneity
    75 The

    model is variously labeled the With Zeros or WZ model Mullahy 1986 the Zero In ated Poisson or ZIP model Lambert 1992 and Zero Altered Poisson or ZAP model Greene 1994

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    751

    It might be of interest to test simply whether there is a regime splitting mechanism at work or not Unfortunately the basic model and the zero in ated model are not nested Setting the parameters of the splitting model to zero for example does not produce Prob z 0 0 In the probit case this probability becomes 0 5 which maintains the regime split The preceding tests for over or underdispersion would be rather indirect What is desired is a test of non Poissonness An alternative distribution may but need not produce a systematically different proportion of zeros than the Poisson Testing for a different distribution as opposed to a different set of parameters is a dif cult procedure Since the hypotheses are necessarily nonnested the power of any test is a function of the alternative hypothesis and may under some be small Vuong 1989 has proposed a test statistic for nonnested models that is well suited for this setting when the alternative distribution can be speci ed Let f j yi xi denote the predicted probability that the random variable Y equals yi under the assumption that the distribution is f j yi xi for j 1 2 and let mi log f1 yi xi f2 yi xi

    Then Vuong s statistic for testing the nonnested hypothesis of Model 1 versus Model 2 is 1n n n i 1 mi v n 1 2 i 1 mi m n This is the standard statistic for testing the hypothesis that E mi equals zero Vuong shows that v has a limiting standard normal distribution As he notes the statistic is bidirectional If v is less than two then the test does not favor one model or the other Otherwise large values favor Model 1 whereas small negative values favor Model 2 Carrying out the test requires estimation of both models and computation of both sets of predicted probabilities In Greene 1994 it is shown that the Vuong test has some power to discern this phenomenon The logic of the testing procedure is to allow for overdispersion by specifying a negative binomial count data process then examine whether even allowing for the overdispersion there still appear to be excess zeros In his application that appears to be the case
    Example 21 12 A Split Population Model for Major Derogatory Reports

    Greene 1995c estimated a model of consumer behavior in which the dependent variable of interest was the number of major derogatory reports recorded in the credit history for a sample of applicants for a type of credit card The basic model predicts yi the number of major derogatory credit reports as a function of xi 1 age income average expenditure The data for the model appear in Appendix Table F21 4 There are 1 319 observations in the sample 10 of the original data set Inspection of the data reveals a preponderance of zeros Indeed of 1 319 observations 1060 have yi 0 whereas of the remaining 259 137 have 1 50 have 2 24 have 3 17 have 4 and 11 have 5 the remaining 20 range from 6 to 14 Thus for a Poisson distribution these data are actually a bit extreme We propose to use Lambert s zero in ated Poisson model instead with the Poisson distribution built around ln i 1 2 age 3 income 4 expenditure For the splitting model we use a logit model with covariates z 1 age income own rent The estimates are shown in Table 21 21 Vuong s diagnostic statistic appears to con rm

    Greene 50240

    book

    June 27 2002

    22 39

    752

    CHAPTER 21 Models for Discrete Choice

    TABLE 21 21 Variable

    Estimates of a Split Population Model
    Poisson and Logit Models Poisson for y Logit for y 0 Split Population Model Poisson for y Logit for y 0

    Constant Age Income Expend Own Rent Log L n P 0 x

    0 8196 0 1453 0 007181 0 003978 0 07790 0 02394 0 004102 0 0003740 1396 719 938 6

    2 2442 0 2515 0 02245 0 007313 0 06931 0 04198 0 3766 0 1578 645 5649

    1 0010 0 1267 0 005073 0 003218 0 01332 0 02249 0 002359 0 0001948

    2 1540 0 2900 0 02469 0 008451 0 1167 0 04941

    0 3865 0 1709 1093 0280 1061 5

    intuition that the Poisson model does not adequately describe the data the value is 6 9788 Using the model parameters to compute a prediction of the number of zeros it is clear that the splitting model does perform better than the basic Poisson regression

    21 10

    SUMMARY AND CONCLUSIONS

    This chapter has surveyed techniques for modeling discrete choice We examined four classes of models binary choice ordered choice multinomial choice and models for counts The rst three of these are quite far removed from the regression models linear and nonlinear that have been the focus of the preceding 20 chapters The most important difference concerns the modeling approach Up to this point we have been primarily interested in modeling the conditional mean function for outcomes that vary continuously In this chapter we have shifted our approach to one of modeling the conditional probabilities of events Modeling binary choice the decision between two alternatives is a growth area in the applied econometrics literature Maximum likelihood estimation of fully parameterized models remains the mainstay of the literature But we also considered semiparametric and nonparametric forms of the model and examined models for time series and panel data The ordered choice model is a natural extension of the binary choice setting and also a convenient bridge between models of choice between two alternatives and more complex models of choice among multiple alternatives Multinomial choice modeling is likewise a large eld both within economics and especially in many other elds such as marketing transportation political science and so on The multinomial logit model and many variations of it provide an especially rich framework within which modelers have carefully matched behavioral modeling to empirical speci cation and estimation Finally models of count data are closer to regression models than the other three elds The Poisson regression model is essentially a nonlinear regression but as in the other cases it is more fruitful to do the modeling in terms of the probabilities of discrete choice rather than as a form of regression analysis

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    753

    Key Terms and Concepts
    Attributes Binary choice model Bivariate probit Bootstrapping Butler and Mof tt method Choice based sampling Chow test Conditional likelihood Kernel density estimator Kernel function Lagrange multiplier test Latent regression Likelihood equations Likelihood ratio test Limited information ML Linear probability model Logit Marginal effects Maximum likelihood Maximum score estimator Maximum simulated Overdispersion Persistence Poisson model Probit Proportions data Quadrature Qualitative choice Qualitative response Quasi MLE Random coef cients Random effects model Random parameters model Random utility model Ranking Recursive model Robust covariance

    function
    Conditional logit Count data Fixed effects model Full information ML Generalized residual Goodness of t measure Grouped data Heterogeneity Heteroscedasticity Incidental parameters

    likelihood
    Mean squared deviation Minimal suf cient statistic Minimum chi squared

    estimation
    Sample selection Scoring method Semiparametric estimation State dependence Unbalanced sample Unordered Weibull model

    problem
    Inclusive value Independence from

    irrelevant alternatives
    Index function model Individual data Initial conditions

    estimator Multinomial logit Multinomial probit Multivariate probit Negative binomial model Nested logit Nonnested models Normit Ordered choice model

    Exercises 1 A binomial probability model is to be based on the following index function model y d y 1 if y 0 y 0 otherwise The only regressor d is a dummy variable The data consist of 100 observations that have the following y 0 0 24 d 1 32 1 28 16

    Obtain the maximum likelihood estimators of and and estimate the asymptotic standard errors of your estimates Test the hypothesis that equals zero by using a Wald test asymptotic t test and a likelihood ratio test Use the probit model and then repeat using the logit model Do your results change Hint Formulate the log likelihood in terms of and

    Greene 50240

    book

    June 27 2002

    22 39

    754

    CHAPTER 21 Models for Discrete Choice

    2 Suppose that a linear probability model is to be t to a set of observations on a dependent variable y that takes values zero and one and a single regressor x that varies continuously across observations Obtain the exact expressions for the least squares slope in the regression in terms of the mean s and variance of x and interpret the result 3 Given the data set y x 1001100111 9254673526

    estimate a probit model and test the hypothesis that x is not in uential in determining the probability that y equals one 4 Construct the Lagrange multiplier statistic for testing the hypothesis that all the slopes but not the constant term equal zero in the binomial logit model Prove that the Lagrange multiplier statistic is nR2 in the regression of yi p on the x s where P is the sample proportion of 1s 5 We are interested in the ordered probit model Our data consist of 250 observations of which the response are y n 0 50 1 40 2 45 3 80 4 35

    Using the preceding data obtain maximum likelihood estimates of the unknown parameters of the model Hint Consider the probabilities as the unknown parameters 6 The following hypothetical data give the participation rates in a particular type of recycling program and the number of trucks purchased for collection by 10 towns in a small mid Atlantic state
    Town Trucks Participation 1 160 11 2 250 74 3 170 8 4 365 87 5 210 62 6 206 83 7 203 48 8 305 84 9 270 71 10 340 79

    The town of Eleven is contemplating initiating a recycling program but wishes to achieve a 95 percent rate of participation Using a probit model for your analysis a How many trucks would the town expect to have to purchase in order to achieve their goal Hint See Section 21 4 6 Note that you will use ni 1 b If trucks cost 20 000 each then is a goal of 90 percent reachable within a budget of 6 5 million That is should they expect to reach the goal c According to your model what is the marginal value of the 301st truck in terms of the increase in the percentage participation 7 A data set consists of n n1 n2 n3 observations on y and x For the rst n1 observations y 1 and x 1 For the next n2 observations y 0 and x 1 For the last n3 observations y 0 and x 0 Prove that neither 21 19 nor 21 21 has a solution

    Greene 50240

    book

    June 27 2002

    22 39

    CHAPTER 21 Models for Discrete Choice

    755

    8 Data on t strike duration and x unanticipated industrial production for a number of strikes in each of 9 years are given in Appendix Table F22 1 Use the Poisson regression model discussed in Section 21 9 to determine whether x is a signi cant determinant of the number of strikes in a given year 9 Asymptotics Explore whether averaging individual marginal effects gives the same answer as computing the marginal effect at the mean 10 Prove 21 28 11 In the panel data models estimated in Example 21 5 1 neither the logit nor the probit model provides a framework for applying a Hausman test to determine whether xed or random effects is preferred Explain Hint Unlike our application in the linear model the incidental parameters problem persists here

    Greene 50240

    book

    June 28 2002

    17 5

    22

    LIMITED DEPENDENT VARIABLE AND DURATION MODELS

    Q
    22 1 INTRODUCTION This chapter is concerned with truncation and censoring 1 The effect of truncation occurs when sample data are drawn from a subset of a larger population of interest For example studies of income based on incomes above or below some poverty line may be of limited usefulness for inference about the whole population Truncation is essentially a characteristic of the distribution from which the sample data are drawn Censoring is a more common problem in recent studies To continue the example suppose that instead of being unobserved all incomes below the poverty line are reported as if they were at the poverty line The censoring of a range of values of the variable of interest introduces a distortion into conventional statistical results that is similar to that of truncation Unlike truncation however censoring is essentially a defect in the sample data Presumably if they were not censored the data would be a representative sample from the population of interest This chapter will discuss four broad topics truncation censoring a form of truncation called the sample selection problem and a class of models called duration models Although most empirical work on the rst three involves censoring rather than truncation we will study the simpler model of truncation rst It provides most of the theoretical tools we need to analyze models of censoring and sample selection The fourth topic on models of duration When will a spell of unemployment or a strike end could reasonably stand alone It does in countless articles and a library of books 2 We include our introduction to this subject in this chapter because in most applications duration modeling involves censored data and it is thus convenient to treat duration here and because we are nearing the end of our survey and yet another chapter seems unwarranted 22 2 TRUNCATION

    In this section we are concerned with inferring the characteristics of a full population from a sample drawn from a restricted part of that population
    1 Five

    of the many surveys of these topics are Dhrymes 1984 Maddala 1977b 1983 1984 and Amemiya 1984 The last is part of a symposium on censored and truncated regression models A survey that is oriented toward applications and techniques is Long 1997 Some recent results on non and semiparametric estimation appear in Lee 1996 example Lancaster 1990 and Kiefer 1985

    2 For

    756

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models 22 2 1 TRUNCATED DISTRIBUTIONS

    757

    A truncated distribution is the part of an untruncated distribution that is above or below some speci ed value For instance in Example 22 2 we are given a characteristic of the distribution of incomes above 100 000 This subset is a part of the full distribution of incomes which range from zero to essentially in nity

    THEOREM 22 1 Density of a Truncated Random Variable If a continuous random variable x has pdf f x and a is a constant then f x 3 Prob x a The proof follows from the de nition of conditional probability and amounts merely to scaling the density so that it integrates to one over the range above a Note that the truncated distribution is a conditional distribution f x x a

    Most recent applications based on continuous random variables use the truncated normal distribution If x has a normal distribution with mean and standard deviation then Prob x a 1 a 1

    where a and is the standard normal cdf The density of the truncated normal distribution is then x 1 2 2 f x 2 2 1 2 e x 2 f x x a 1 1 1 where is the standard normal pdf The truncated standard normal distribution with 0 and 1 is illustrated for a 0 5 0 and 0 5 in Figure 22 1 Another truncated distribution which has appeared in the recent literature this one for a discrete random variable is the truncated at zero Poisson distribution Prob Y y y 0 e y y e y y Prob Y 0 1 Prob Y 0 e y y 1 e 0 y 1

    This distribution is used in models of uses of recreation and other kinds of facilities where observations of zero uses are discarded 4 For convenience in what follows we shall call a random variable whose distribution is truncated a truncated random variable
    3 The case of truncation from above instead of below is handled in an analogous fashion and does not require

    any new results
    4 See

    Shaw 1988

    Greene 50240

    book

    June 28 2002

    17 5

    758

    CHAPTER 22 Limited Dependent Variable and Duration Models

    1 2
    Truncation point Mean of distribution

    1 0

    0 8 Density

    0 6

    0 4

    0 2

    0

    3

    2

    1

    0 5

    0 x

    0 5

    1

    2

    3

    FIGURE 22 1

    Truncated Normal Distributions

    22 2 2

    MOMENTS OF TRUNCATED DISTRIBUTIONS

    We are usually interested in the mean and variance of the truncated random variable They would be obtained by the general formula E x x a
    a

    x f x x a dx

    for the mean and likewise for the variance
    Example 22 1 Truncated Uniform Distribution

    If x has a standard uniform distribution denoted U 0 1 then f x 1 The truncated at x f
    1 3

    0 x 1

    distribution is also uniform 1 3 f x Prob x
    1 3

    x x



    1
    2 3



    3 2

    1 x 1 3

    The expected value is E x x 1 3
    1

    x
    1 3

    3 2 dx 2 3

    For a variable distributed uniformly between L and U the variance is U L 2 12 Thus Var x x
    1 3



    1 27 1 2

    The mean and variance of the untruncated distribution are

    and

    1 12

    respectively

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    759

    Example 22 1 illustrates two results 1 If the truncation is from below then the mean of the truncated variable is greater than the mean of the original one If the truncation is from above then the mean of the truncated variable is smaller than the mean of the original one This is clearly visible in Figure 22 1 Truncation reduces the variance compared with the variance in the untruncated distribution

    2

    Henceforth we shall use the terms truncated mean and truncated variance to refer to the mean and variance of the random variable with a truncated distribution For the truncated normal distribution we have the following theorem 5

    THEOREM 22 2 Moments of the Truncated Normal Distribution If x N 2 and a is a constant then E x truncation Var x truncation 2 1 where a is the standard normal density and 1 if truncation is x a if truncation is x a 22 3a 22 3b and 22 4 22 1 22 2

    An important result is 0 1 for all values of

    which implies point 2 after Example 22 1 A result that we will use at several points below is d d The function is called the inverse Mills ratio The function in 22 3a is also called the hazard function for the standard normal distribution
    Example 22 2

    The typical upper af uent American makes 142 000 per year The people surveyed had household income of at least 100 000 6 Would this statistic tell us anything about the typical American As it stands it probably does not popular impressions notwithstanding The 1987 article where this appeared went on to state If you re in that category pat yourself on the back only 2 percent of American households make the grade according to the survey Since the degree of truncation in the sample is 98 percent the 142 000 was probably quite far from the mean in the full population Suppose that incomes in the population were lognormally distributed see Section B 4 4 Then the log of income had a normal distribution with say mean and standard deviation We ll deduce and then determine the population mean income Let x income
    may be found in Johnson Kotz and Balakrishnan 1994 pp 156 158

    A Truncated Lognormal Income Distribution

    5 Details 6 New

    York Post 1987

    Greene 50240

    book

    June 28 2002

    17 5

    760

    CHAPTER 22 Limited Dependent Variable and Duration Models

    and let y ln x Two useful numbers for this example are ln 100 4 605 and ln 142 4 956 Suppose that the survey was large enough for us to treat the sample average as the true mean Then the article stated that E y y 4 605 4 956 It also told us that Prob y 4 605 0 02 From Theorem 22 2 E y y 4 605 1

    where 4 605 We also know that 0 98 so 1 0 98 2 054 We infer then that a 2 054 4 605 In addition given 2 054 2 054 0 0484 From 22 1 then 4 956 0 0484 0 02 or b 4 956 2 420 The solutions to a and b are 2 635 and 0 959 To obtain the mean income we now use the result that if y N 2 and x e y then 2 E x E e y e 2 Inserting our values for and gives E x 22 087 The 1987 Statistical Abstract of the United States listed average household income across all groups for the United States as about 25 000 So the estimate based on surprisingly little information would have been relatively good These meager data did indeed tell us something about the average American
    22 2 3 THE TRUNCATED REGRESSION MODEL

    In the model of the earlier examples we now assume that i xi is the deterministic part of the classical regression model Then yi xi i where i xi N 0 2 so that yi xi N xi 2 22 5

    We are interested in the distribution of yi given that yi is greater than the truncation point a This is the result described in Theorem 22 2 It follows that E yi yi a xi a xi 1 a xi 22 6

    The conditional mean is therefore a nonlinear function of a x and The marginal effects in this model in the subpopulation can be obtained by writing E yi yi a xi i E yi yi a i d i d i xi xi i2 i i 1 i2 i i 1 i Note the appearance of the truncated variance Since the truncated variance is between zero and one we conclude that for every element of xi the marginal effect is less than 22 8 22 7 where now i a xi For convenience let i i and i i Then

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    761

    the corresponding coef cient There is a similar attenuation of the variance In the subpopulation yi a the regression variance is not 2 but Var yi yi a 2 1 i 22 9

    Whether the marginal effect in 22 7 or the coef cient itself is of interest depends on the intended inferences of the study If the analysis is to be con ned to the subpopulation then 22 7 is of interest If the study is intended to extend to the entire population however then it is the coef cients that are actually of interest One s rst inclination might be to use ordinary least squares to estimate the parameters of this regression model For the subpopulation from which the data are drawn we could write 22 6 in the form yi yi a E yi yi a ui xi i ui 22 10

    where ui is yi minus its conditional expectation By construction ui has a zero mean but it is heteroscedastic Var ui 2 1 i2 i i 2 1 i which is a function of xi If we estimate 22 10 by ordinary least squares regression of y on X then we have omitted a variable the nonlinear term i All the biases that arise because of an omitted variable can be expected 7 Without some knowledge of the distribution of x it is not possible to determine how serious the bias is likely to be A result obtained by Cheung and Goldberger 1984 is broadly suggestive If E x y in the full population is a linear function of y then plim b for some proportionality constant This result is consistent with the widely observed albeit rather rough proportionality relationship between least squares estimates of this model and consistent maximum likelihood estimates 8 The proportionality result appears to be quite general In applications it is usually found that compared with consistent maximum likelihood estimates the OLS estimates are biased toward zero See Example 22 4

    22 3

    CENSORED DATA

    A very common problem in microeconomic data is censoring of the dependent variable When the dependent variable is censored values in a certain range are all transformed to or reported as a single value Some examples that have appeared in the empirical literature are as follows 9 1 2 3 4
    7 See 8 See

    Household purchases of durable goods Tobin 1958 The number of extramarital affairs Fair 1977 1978 The number of hours worked by a woman in the labor force Quester and Greene 1982 The number of arrests after release from prison Witte 1980
    Heckman 1979 who formulates this as a speci cation error the appendix in Hausman and Wise 1977 and Greene 1983 as well extensive listings may be found in Amemiya 1984 and Maddala 1983

    9 More

    Greene 50240

    book

    June 28 2002

    17 5

    762

    CHAPTER 22 Limited Dependent Variable and Duration Models

    5 6

    Household expenditure on various commodity groups Jarque 1987 Vacation expenditures Melenberg and van Soest 1996

    Each of these studies analyzes a dependent variable that is zero for a signi cant fraction of the observations Conventional regression methods fail to account for the qualitative difference between limit zero observations and nonlimit continuous observations
    22 3 1 THE CENSORED NORMAL DISTRIBUTION

    The relevant distribution theory for a censored variable is similar to that for a truncated one Once again we begin with the normal distribution as much of the received work has been based on an assumption of normality We also assume that the censoring point is zero although this is only a convenient normalization In a truncated distribution only the part of distribution above y 0 is relevant to our computations To make the distribution integrate to one we scale it up by the probability that an observation in the untruncated population falls in the range that interests us When data are censored the distribution that applies to the sample data is a mixture of discrete and continuous distributions Figure 22 2 illustrates the effects To analyze this distribution we de ne a new random variable y transformed from the original one y by y 0 if y 0 y y if y 0 The distribution that applies if y N 2 is Prob y 0 Prob y 0 1 and if y 0 then y has the density of y This distribution is a mixture of discrete and continuous parts The total probability is one as required but instead of scaling the second part we simply assign the full probability in the censored region to the censoring point in this case zero
    FIGURE 22 2 Partially Censored Distribution

    Capacity

    Seats demanded

    Capacity

    Tickets sold

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    763

    THEOREM 22 3 Moments of the Censored Normal Variable If y N 2 and y a if y a or else y y then E y and Var y 2 1 where a and 2 Proof For the mean E y Prob y a E y y a Prob y a E y y a Prob y a a Prob y a E y y a a 1 Prob y a 1 1 2 a 1

    using Theorem 22 2 For the variance we use a counterpart to the decomposition in B 70 that is Var y E conditional variance Var conditional mean and Theorem 22 2

    For the special case of a 0 the mean simpli es to E y a 0 where

    For censoring of the upper part of the distribution instead of the lower it is only necessary to reverse the role of and 1 and rede ne as in Theorem 22 2
    Example 22 3 Censored Random Variable

    We are interested in the number of tickets demanded for events at a certain arena Our only measure is the number actually sold Whenever an event sells out however we know that the actual number demanded is larger than the number sold The number of tickets demanded is censored when it is transformed to obtain the number sold Suppose that the arena in question has 20 000 seats and in a recent season sold out 25 percent of the time If the average attendance including sellouts was 18 000 then what are the mean and standard deviation of the demand for seats According to Theorem 22 3 the 18 000 is an estimate of E sales 20 000 1

    Since this is censoring from above rather than below The argument of and is 20 000 If 25 percent of the events are sellouts then 0 75 Inverting the standard normal at 0 75 gives 0 675 In addition if 0 675 then 0 675 0 75 0 424 This result provides two equations in and a 18 000 0 25 20 000 0 75 0 424 and b 0 675 20 000 The solutions are 2426 and 18 362

    Greene 50240

    book

    June 28 2002

    17 5

    764

    CHAPTER 22 Limited Dependent Variable and Duration Models

    For comparison suppose that we were told that the mean of 18 000 applies only to the events that were not sold out and that on average the arena sells out 25 percent of the time Now our estimates would be obtained from the equations a 18 000 0 424 and b 0 675 20 000 The solutions are 1820 and 18 772
    22 3 2 THE CENSORED REGRESSION TOBIT MODEL

    The regression model based on the preceding discussion is referred to as the censored regression model or the tobit model In reference to Tobin 1958 where the model was rst proposed The regression is obtained by making the mean in the preceding correspond to a classical regression model The general formulation is usually given in terms of an index function yi xi i yi 0 yi yi if yi 0 if yi 0 22 11

    There are potentially three conditional mean functions to consider depending on the purpose of the study For the index variable sometimes called the latent variable E yi xi is xi If the data are always censored however then this result will usually not be useful Consistent with Theorem 22 3 for an observation randomly drawn from the population which may or may not be censored E yi xi where i 0 xi xi 1 0 xi xi 22 12 xi xi i

    Finally if we intend to con ne our attention to uncensored observations then the results for the truncated regression model apply The limit observations should not be discarded however because the truncated regression model is no more amenable to least squares than the censored data model It is an unresolved question which of these functions should be used for computing predicted values from this model Intuition suggests that E yi xi is correct but authors differ on this point For the setting in Example 22 3 for predicting the number of tickets sold say to plan for an upcoming event the censored mean is obviously the relevant quantity On the other hand if the objective is to study the need for a new facility then the mean of the latent variable yi would be more interesting There are differences in the marginal effects as well For the index variable E yi xi xi But this result is not what will usually be of interest since yi is unobserved For the observed data yi the following general result will be useful 10
    10 See

    Greene 1999 for the general result and Rosett and Nelson 1975 and Nakamura and Nakamura 1983 for applications based on the normal distribution

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    765

    THEOREM 22 4 Marginal Effects in the Censored Regression Model In the censored regression model with latent regression y x and observed dependent variable y a if y a y b if y b and y y otherwise where a and b are constants let f and F denote the density and cdf of Assume that is a continuous random variable with mean 0 and variance 2 and f x f Then E y x Prob a y b x Proof By de nition E y x a Prob y a x b Prob y b x Prob a y b x E y a y b x Let j j x F j F j f j f j and j a b Then E y x a Fa b 1 Fb Fb Fa E y a y b x Since y x y x the conditional mean may be written E y a y b x x E x Collecting terms we have E y x a Fa b 1 Fb Fb Fa x
    b a

    y x a x y x b x
    b

    a

    f d Fb Fa d

    f

    Now differentiate with respect to x The only complication is the last term for which the differentiation is with respect to the limits of integration We use Leibnitz s theorem and use the assumption that f does not involve x Thus E y x x a fa b fb Fb Fa x fb fa

    b fb a fa

    After inserting the de nitions of a and b and collecting terms we nd all terms sum to zero save for the desired result E y x Fb Fa Prob a yi b x

    Note that this general result includes censoring in either or both tails of the distribution and it does not assume that is normally distributed For the standard case with

    Greene 50240

    book

    June 28 2002

    17 5

    766

    CHAPTER 22 Limited Dependent Variable and Duration Models

    censoring at zero and normally distributed disturbances the result specializes to E yi xi xi xi

    Although not a formal result this does suggest a reason why in general least squares estimates of the coef cients in a tobit model usually resemble the MLEs times the proportion of nonlimit observations in the sample McDonald and Mo tt 1980 suggested a useful decomposition of E yi xi xi E yi xi xi
    i 1

    i i i i i i
    i

    where i xi i i and i i decomposes the slope vector into

    Taking the two parts separately this result

    E yi xi E yi xi yi 0 Prob yi 0 Prob yi 0 E yi xi yi 0 xi xi xi Thus a change in xi has two effects It affects the conditional mean of yi in the positive part of the distribution and it affects the probability that the observation will fall in that part of the distribution
    Example 22 4 Estimated Tobit Equations for Hours Worked

    In their study of the number of hours worked in a survey year by a large sample of wives Quester and Greene 1982 were interested in whether wives whose marriages were statistically more likely to dissolve hedged against that possibility by spending on average more time working They reported the tobit estimates given in Table 22 1 The last gure in the table implies that a very large proportion of the women reported zero hours so least squares regression would be inappropriate The gures in parentheses are the ratio of the coef cient estimate to the estimated asymptotic standard error The dependent variable is hours worked in the survey year Small kids is a dummy variable indicating whether there were children in the household The education difference and relative wage variables compare husband and wife on these two dimensions The wage rate used for wives was predicted using a previously estimated regression model and is thus available for all individuals whether working or not Second marriage is a dummy variable Divorce probabilities were produced by a large microsimulation model presented in another study Orcutt Caldwell and Wertheimer 1976 The variables used here were dummy variables indicating mean if the predicted probability was between 0 01 and 0 03 and high if it was greater than 0 03 The slopes are the marginal effects described earlier Note the marginal effects compared with the tobit coef cients Likewise the estimate of is quite misleading as an estimate of the standard deviation of hours worked The effects of the divorce probability variables were as expected and were quite large One of the questions raised in connection with this study was whether the divorce probabilities could reasonably be treated as independent variables It might be that for these individuals the number of hours worked was a signi cant determinant of the probability
    22 3 3 ESTIMATION

    Estimation of this model is very similar to that of truncated regression The tobit model has become so routine and been incorporated in so many computer packages that despite formidable obstacles in years past estimation is now essentially on the level of

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    767

    TABLE 22 1

    Tobit Estimates of an Hours Worked Equation
    White Wives Coef cient Slope Black Wives Coef cient Slope Least Squares Scaled OLS

    Constant Small kids Education difference Relative wage Second marriage Mean divorce probability High divorce probability Sample size Proportion working

    1803 13 8 64 1324 84 385 89 19 78 48 08 14 00 4 77 312 07 90 90 5 71 175 85 51 51 3 47 417 39 121 58 6 52 670 22 195 22 8 40 1559 618 7459 0 29

    2753 87 9 68 824 19 376 53 10 14 22 59 10 32 1 96 286 39 130 93 3 32 25 33 11 57 0 41 481 02 219 75 5 28 578 66 264 36 5 33 1511 826 2798 0 46

    352 63 11 47 123 95 13 14 219 22 244 17

    766 56 24 93 269 46 28 57 476 57 530 80

    ordinary linear regression 11 The log likelihood for the censored regression model is ln L
    yi 0



    1 yi xi 2 ln 1 log 2 ln 2 2 2 y 0
    i

    xi



    22 13

    The two parts correspond to the classical regression for the nonlimit observations and the relevant probabilities for the limit observations respectively This likelihood is a nonstandard type since it is a mixture of discrete and continuous distributions In a seminal paper Amemiya 1973 showed that despite the complications proceeding in the usual fashion to maximize log L would produce an estimator with all the familiar desirable properties attained by MLEs The log likelihood function is fairly involved but Olsen s 1978 reparameterization simpli es things considerably With and 1 the log likelihood is ln L 1 ln 2 ln 2 yi xi 2 ln 1 2 y 0 y 0
    i i

    xi

    22 14

    The results in this setting are now very similar to those for the truncated regression The Hessian is always negative de nite so Newton s method is simple to use and usually converges quickly After convergence the original parameters can be recovered using 1 and The asymptotic covariance matrix for these esti mates can be obtained from that for the estimates of using Est Asy Var Asy Var J where J J
    11 See







    1 I 0

    1 2 1 2



    Hall 1984

    Greene 50240

    book

    June 28 2002

    17 5

    768

    CHAPTER 22 Limited Dependent Variable and Duration Models

    Researchers often compute ordinary least squares estimates despite their inconsistency Almost without exception it is found that the OLS estimates are smaller in absolute value than the MLEs A striking empirical regularity is that the maximum likelihood estimates can often be approximated by dividing the OLS estimates by the proportion of nonlimit observations in the sample 12 The effect is illustrated in the last two columns of Table 22 1 Another strategy is to discard the limit observations but we now see that just trades the censoring problem for the truncation problem
    22 3 4 SOME ISSUES IN SPECIFICATION

    Two issues that commonly arise in microeconomic data heteroscedasticity and nonnormality have been analyzed at length in the tobit setting 13
    22 3 4 a Heteroscedasticity

    Maddala and Nelson 1975 Hurd 1979 Arabmazar and Schmidt 1982a b and Brown and Mof tt 1982 all have varying degrees of pessimism regarding how inconsistent the maximum likelihood estimator will be when heteroscedasticity occurs Not surprisingly the degree of censoring is the primary determinant Unfortunately all the analyses have been carried out in the setting of very speci c models for example involving only a single dummy variable or one with groupwise heteroscedasticity so the primary lesson is the very general conclusion that heteroscedasticity emerges as an obviously serious problem One can approach the heteroscedasticity problem directly Petersen and Waldman 1981 present the computations needed to estimate a tobit model with heteroscedasticity of several types Replacing with i in the log likelihood function and including i2 in the summations produces the needed generality Speci cation of a particular model for i provides the empirical model for estimation
    Example 22 5 Multiplicative Heteroscedasticity in the Tobit Model

    Petersen and Waldman 1981 analyzed the volume of short interest in a cross section of common stocks The regressors included a measure of the market component of heterogeneous expectations as measured by the rm s BETA coef cient a company speci c measure of heterogeneous expectations NONMARKET the NUMBER of analysts making earnings forecasts for the company the number of common shares to be issued for the acquisition of another rm MERGER and a dummy variable for the existence of OPTIONs They report the results listed in Table 22 2 for a model in which the variance is assumed to be of the form i2 exp xi The values in parentheses are the ratio of the coef cient to the estimated asymptotic standard error The effect of heteroscedasticity on the estimates is extremely large We do note however a common misconception in the literature The change in the coef cients is often misleading The marginal effects in the heteroscedasticity model will generally be very similar to those computed from the model which assumes homoscedasticity The calculation is pursued in the exercises A test of the hypothesis that 0 except for the constant term can be based on the likelihood ratio statistic For these results the statistic is 2 547 3 466 27 162 06 This statistic has a limiting chi squared distribution with ve degrees of freedom The sample value exceeds the critical value in the table of 11 07 so the hypothesis can be rejected
    12 This concept is explored further in Greene 1980b Goldberger 1981 and Cheung and Goldberger 1984 13 Two

    symposia that contain numerous results on these subjects are Blundell 1987 and Duncan 1986b An application that explores these two issues in detail is Melenberg and van Soest 1996

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    769

    TABLE 22 2

    Estimates of a Tobit Model Standard errors in parentheses
    Homoscedastic Heteroscedastic

    Constant Beta Nonmarket Number Merger Option Log L Sample size

    18 28 5 10 10 97 3 61 0 65 7 41 0 75 5 74 0 50 5 90 2 56 1 51 547 30 200

    4 11 3 28 0 47 0 60 2 22 2 00 1 20 1 81 0 12 1 90 0 08 7 55 0 33 4 50 0 15 4 58 0 24 3 00 0 06 4 17 2 96 2 99 0 83 1 70 466 27 200

    In the preceding example we carried out a likelihood ratio test against the hypothesis of homoscedasticity It would be desirable to be able to carry out the test without having to estimate the unrestricted model A Lagrange multiplier test can be used for that purpose Consider the heteroscedastic tobit model in which we specify that i2 2 e wi 22 15

    This model is a fairly general speci cation that includes many familiar ones as special cases The null hypothesis of homoscedasticity is 0 We used this speci cation in the probit model in Section 19 4 1 b and in the linear regression model in Section 17 7 1 Using the BHHH estimator of the Hessian as usual we can produce a Lagrange multiplier statistic as follows Let zi 1 if yi is positive and 0 otherwise ai zi bi zi i i 2 i2 2 1 2 2 1 zi 1 zi 1 i xi i 2 3 22 16

    xi 1 i xi

    The data vector is gi ai xi bi bi wi The sums are taken over all observations and all functions involving unknown parameters xi i etc are evaluated at the restricted homoscedastic maximum likelihood estimates Then LM i G G G 1 G i nR2 22 17

    in the regression of a column of ones on the K 1 P derivatives of the log likelihood function for the model with multiplicative heteroscedasticity evaluated at the estimates from the restricted model If there were no limit observations then it would reduce to the Breusch Pagan statistic discussed in Section 11 4 3 Given the maximum likelihood estimates of the tobit model coef cients it is quite simple to compute The statistic has a limiting chi squared distribution with degrees of freedom equal to the number of variables in wi

    Greene 50240

    book

    June 28 2002

    17 5

    770

    CHAPTER 22 Limited Dependent Variable and Duration Models 22 3 4 b Misspeci cation of Prob y 0

    In an early study in this literature Cragg 1971 proposed a somewhat more general model in which the probability of a limit observation is independent of the regression model for the nonlimit data One can imagine for instance the decision on whether or not to purchase a car as being different from the decision on how much to spend on the car having decided to buy one A related problem raised by Lin and Schmidt 1984 is that in the tobit model a variable that increases the probability of an observation being a nonlimit observation also increases the mean of the variable They cite as an example loss due to re in buildings Older buildings might be more likely to have res so that Prob yi 0 agei 0 but because of the greater value of newer buildings older ones incur smaller losses when they do have res so that E yi yi 0 agei 0 This fact would require the coef cient on age to have different signs in the two functions which is impossible in the tobit model because they are the same coef cient A more general model that accommodates these objections is as follows 1 Decision equation Prob yi 0 2 xi xi zi 1 if yi 0 zi 0 if yi 0 Prob yi 0 1 22 18

    Regression equation for nonlimit observations E yi zi 1 xi i according to Theorem 22 2

    This model is a combination of the truncated regression model of Section 22 2 and the univariate probit model of Section 21 3 which suggests a method of analyzing it The tobit model of this section arises if equals The parameters of the regression equation can be estimated independently using the truncated regression model of Section 22 2 A recent application is Melenberg and van Soest 1996 Fin and Schmidt 1984 considered testing the restriction of the tobit model Based only on the tobit model they devised a Lagrange multiplier statistic that although a bit cumbersome algebraically can be computed without great dif culty If one is able to estimate the truncated regression model the tobit model and the probit model separately then there is a simpler way to test the hypothesis The tobit log likelihood is the sum of the log likelihoods for the truncated regression and probit models To show this result add and subtract yi 1 ln xi in 22 13 This produces the loglikelihood for the truncated regression model plus 21 20 for the probit model 14 Therefore a likelihood ratio statistic can be computed using 2 ln LT ln LP ln LTR where LT likelihood for the tobit model in 22 13 with the same coef cients LP likelihood for the probit model in 19 20 t separately LTR likelihood for the truncated regression model t separately
    14 The

    likelihood function for the truncated regression model is considered in the exercises

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models 22 3 4 c Nonnormality

    771

    Nonnormality is an especially dif cult problem in this setting It has been shown that if the underlying disturbances are not normally distributed then the estimator based on 22 13 is inconsistent Research is ongoing both on alternative estimators and on methods for testing for this type of misspeci cation 15 One approach to the estimation is to use an alternative distribution Kalb eisch and Prentice 1980 present a unifying treatment that includes several distributions such as the exponential lognormal and Weibull Their primary focus is on survival analysis in a medical statistics setting which is an interesting convergence of the techniques in very different disciplines Of course assuming some other speci c distribution does not necessarily solve the problem and may make it worse A preferable alternative would be to devise an estimator that is robust to changes in the distribution Powell s 1981 1984 least absolute deviations LAD estimator appears to offer some promise 16 The main drawback to its use is its computational complexity An extensive application of the LAD estimator is Melenberg and van Soest 1996 Although estimation in the nonnormal case is relatively dif cult testing for this failure of the model is worthwhile to assess the estimates obtained by the conventional methods Among the tests that have been developed are Hausman tests Lagrange multiplier tests Bera and Jarque 1981 1982 Bera Jarque and Lee 1982 and conditional moment tests Nelson 1981 The conditional moment tests are described in the next section To employ a Hausman test we require an estimator that is consistent and ef cient under the null hypothesis but inconsistent under the alternative the tobit estimator with normality and an estimator that is consistent under both hypotheses but inef cient under the null hypothesis Thus we will require a robust estimator of which restores the dif culties of the previous paragraph Recent applications e g Melenberg and van Soest 1996 have used the Hausman test to compare the tobit normal estimator with Powell s consistent but inef cient robust LAD estimator Another approach to testing is to embed the normal distribution in some other distribution and then use an LM test for the normal speci cation Chesher and Irish 1987 have devised an LM test of normality in the tobit model based on generalized residuals In many models including the tobit model the generalized residuals can be computed as the derivatives of the log densities with respect to the constant term so ei 1 zi yi xi 1 zi i 2

    where zi is de ned in 22 18 and i is de ned in 22 16 This residual is an estimate of i that accounts for the censoring in the distribution By construction E ei xi 0 and if the model actually does contain a constant term then in 1 ei 0 this is the rst of the necessary conditions for the MLE The test is then carried out by regressing a column of 1s on di ei xi bi ei3 ei4 3ei4 where bi is de ned in 22 16 Note that the rst K 1 variables in di are the derivatives of the tobit log likelihood Let D be the n K 3 matrix with i th row equal to di Then D G M where the K 1 columns
    15 See 16 See

    Duncan 1983 1986b Goldberger 1983 Pagan and Vella 1989 Lee 1996 and Fernandez 1986 We will examine one of the tests more closely in the following section Duncan 1986a b for a symposium on the subject and Amemiya 1984 Additional references are Newey Powell and Walker 1990 Lee 1996 and Robinson 1988

    Greene 50240

    book

    June 28 2002

    17 5

    772

    CHAPTER 22 Limited Dependent Variable and Duration Models

    of G are the derivatives of the tobit log likelihood and the two columns in M are the last two variables in ai Then the chi squared statistic is nR2 that is LM i D D D 1 D i The necessary conditions that de ne the MLE are i G 0 so the rst K 1 elements of i D are zero Using B 66 then the LM statistic becomes LM i M M M M G G G 1 G M 1 M i which is a chi squared statistic with two degrees of freedom Note the similarity to 22 17 where a test for homoscedasticity is carried out by the same method As emerges so often in this framework the test of the distribution actually focuses on the skewness and kurtosis of the residuals
    22 3 4 d Conditional Moment Tests

    Pagan and Vella 1989 see as well Ruud 1984 describe a set of conditional moment tests of the speci cation of the tobit model 17 We will consider three 1 2 3 The variables z have not been erroneously omitted from the model The disturbances in the model are homoscedastic The underlying disturbances in the model are normally distributed

    For the third of these we will take the standard approach of examining the third and fourth moments which for the normal distribution are 0 and 3 4 respectively The underlying motivation for the tests can be made with reference to the regression part of the tobit model in 22 11 yi xi i Neglecting for the moment that we only observe yi subject to the censoring the three hypotheses imply the following expectations 1 2 3 E zi yi xi 0 E zi yi xi 2 2 0 E yi xi 3 0 and E yi xi 4 3 4 0

    In 1 the variables in zi would be one or more variables not already in the model We are interested in assessing whether or not they should be In 2 presumably although not necessarily zi would be the regressors in the model For the present we will assume that yi is observed directly without censoring That is we will construct the CM tests for the classical linear regression model Then we will go back to the necessary step and make the modi cation needed to account for the censoring of the dependent variable
    17 Their

    survey is quite general and includes other models speci cations and estimation methods We will consider only the simplest cases here The reader is referred to their paper for formal presentation of these results Developing speci cation tests for the tobit model has been a popular enterprise A sampling of the received literature includes Nelson 1981 Bera Jarque and Lee 1982 Chesher and Irish 1987 Chesher Lancaster and Irish 1985 Gourieroux et al 1984 1987 Newey 1986 Rivers and Vuong 1988 Horowitz and Neumann 1989 and Pagan and Vella 1989 Newey 1985a b are useful references on the general subject of conditional moment testing More general treatments of speci cation testing are Godfrey 1988 and Ruud 1984

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    773

    Conditional moment tests are described in Section 17 6 4 To review for a model estimated by maximum likelihood the statistic is C i M M M M G G G 1 G M 1 M i where the rows of G are the terms in the gradient of the log likelihood function G G 1 is the BHHH estimator of the asymptotic covariance matrix of the MLE of the model parameters and the rows of M are the individual terms in the sample moment conditions Note that this construction is the same as the LM statistic just discussed The difference is in how the rows of M are constructed For a regression model without censoring the sample counterparts to the moment restrictions in 1 to 3 would be r1 1 n
    n

    zi ei
    i 1 n

    where ei yi xi b and b X X 1 X y ee n

    1 r2 n 1 r3 n

    zi ei2 s 2 where s 2
    i 1 n i 1

    ei4

    ei3 3s 4

    For the positive observations we observe y so the observations in M are the same as for the classical regression model that is 1 2 3 mi zi yi xi mi zi yi xi 2 2 mi yi xi 3 yi xi 4 3 4

    For the limit observations these observations are replaced with their expected values conditioned on y 0 which means that y 0 or ei xi Let qi xi and i i 1 i Then from 22 2 22 3b and 22 4 1 2 mi zi E yi xi y 0 zi xi i xi zi 2 i mi zi E yi xi 2 2 y 0 zi 2 1 qi i 2 zi 2 qi i

    E i2 y 0 xi is not the variance since the mean is not zero For the third and fourth moments we simply reproduce Pagan and Vella s results see also Greene 1995a pp 618 619 3 mi 3 i 2 qi2 qi 3 qi2

    These three items are the remaining terms needed to compute M
    22 3 5 CENSORING AND TRUNCATION IN MODELS FOR COUNTS

    Truncation and censoring are relatively common in applications of models for counts see Section 21 9 Truncation often arises as a consequence of discarding what appear to be unusable data such as the zero values in survey data on the number of uses of recreation facilities Shaw 1988 and Bockstael et al 1990 The zero values in this setting might represent a discrete decision not to visit the site which is a qualitatively different decision from the positive number for someone who had decided to make at

    Greene 50240

    book

    June 28 2002

    17 5

    774

    CHAPTER 22 Limited Dependent Variable and Duration Models

    least one visit In such a case it might make sense to con ne attention to the nonzero observations thereby truncating the distribution Censoring in contrast is often employed to make survey data more convenient to gather and analyze For example survey data on access to medical facilities might ask How many trips to the doctor did you make in the last year The responses might be 0 1 2 3 or more Models with these characteristics can be handled within the Poisson and negative binomial regression frameworks by using the laws of probability to modify the likelihood For example in the censored data case Pi j Prob yi j e i i j
    j

    j 0 1 2

    Pi 3 Prob yi 3 1 Prob yi 0 Prob yi 1 Prob yi 2 The probabilities in the model with truncation above zero would be Pi j Prob yi j e i i e i i 1 Pi 0 j 1 e i j
    j j

    j 1 2

    These models are not appreciably more complicated to analyze than the basic Poisson or negative binomial models See Terza 1985b Mullahy 1986 Shaw 1988 Grogger and Carson 1991 Greene 1998 Lambert 1992 and Winkelmann 1997 They do however bring substantive changes to the familiar characteristics of the models For example the conditional means are no longer i in the censoring case


    E yi xi i
    j 3

    j 3 Pi j i

    Marginal effects are changed as well Recall that our earlier result for the count data models was E yi xi xi i With censoring or truncation it is straightforward in general to show that E yi xi xi i but the new scale factor need not be smaller than i
    22 3 6 APPLICATION CENSORING IN THE TOBIT AND POISSON REGRESSION MODELS

    In 1969 the popular magazine Psychology Today published a 101 question survey on sex and asked its readers to mail in their answers The results of the survey were discussed in the July 1970 issue From the approximately 2 000 replies that were collected in electronic form of about 20 000 received Professor Ray Fair 1978 extracted a sample of 601 observations on men and women then currently married for the rst time and analyzed their responses to a question about extramarital affairs He used the tobit model as a platform Fair s analysis in this frequently cited study suggests several interesting econometric questions In addition his 1977 companion paper in Econometrica on estimation of the tobit model contributed to the development of the EM algorithm which was published by and is usually associated with Dempster Laird and Rubin 1977 As noted Fair used the tobit model as his estimation framework for this study The nonexperimental nature of the data which can be downloaded from the Internet at http fairmodel econ yale edu rayfair work ss htm provides a ne laboratory case that

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    775

    we can use to examine the relationships among the tobit truncated regression and probit models In addition as we will explore below although the tobit model seems to be a natural choice for the model for these data a closer look suggests that the models for counts we have examined at several points earlier might be yet a better choice Finally the preponderance of zeros in the data that initially motivated the tobit model suggests that even the standard Poisson model although an improvement might still be inadequate In this example we will reestimate Fair s original model and then apply some of the speci cation tests and modi ed models for count data as alternatives The study was based on 601 observations on the following variables full details on data coding are given in the data le and Appendix Table F22 2 y number of affairs in the past year 0 1 2 3 4 10 coded as 7 monthly weekly or daily coded as 12 Sample mean 1 46 Frequencies 451 34 17 19 42 38 z1 sex 0 for female 1 for male Sample mean 0 476 z2 age Sample mean 32 5 z3 number of years married Sample mean 8 18 z4 children 0 no 1 yes Sample mean 0 715 z5 religiousness 1 anti 5 very Sample mean 3 12 z6 education years 9 grade school 12 high school 20 Ph D or other Sample mean 16 2 z7 occupation Hollingshead scale 1 7 Sample mean 4 19 z8 self rating of marriage 1 very unhappy 5 very happy Sample mean 3 93 The tobit model was t to y using a constant term and all eight variables A restricted model was t by excluding z1 z4 and z6 none of which was individually statistically signi cant in the model We are able to match exactly Fair s results for both equations The log likelihood functions for the full and restricted models are 2704 7311 and 2705 5762 The chi squared statistic for testing the hypothesis that the three coef cients are zero is twice the difference 1 6902 The critical value from the chi squared distribution with three degrees of freedom is 7 81 so the hypothesis that the coef cients on these three variables are all zero is not rejected The Wald and Lagrange multiplier statistics are likewise small 6 59 and 1 681 Based on these results we will continue the analysis using the restricted set of variables Z 1 z2 z3 z5 z7 z8 Our interest is solely in the numerical results of different modeling approaches Readers may draw their own conclusions and interpretations from the estimates Table 22 3 presents parameter estimates based on Fair s speci cation of the normal distribution The inconsistent least squares estimates appear at the left as a basis for comparison The maximum likelihood tobit estimates appear next The sample is heavily dominated by observations with y 0 451 of 601 or 75 percent so the marginal effects are very different from the coef cients by a multiple of roughly 0 766 The scale factor is computed using the results of Theorem 22 4 for left censoring at zero and the upper limit of with all variables evaluated at the sample means and the parameters equal

    Greene 50240

    book

    June 28 2002

    17 5

    776

    CHAPTER 22 Limited Dependent Variable and Duration Models

    TABLE 22 3

    Model Estimates Based on the Normal Distribution Standard Errors in Parentheses
    Tobit Truncated Regression Scaled by 1 4 Probit Estimate 5 Estimate 6 Marginal Effect 7 Marginal Effect 3

    Variable

    Least Squares 1

    Estimate 2

    Constant z2 z3 z5 z7 z8 log L

    5 61 0 797 0 0504 0 0221 0 162 0 0369 0 476 0 111 0 106 0 0711 0 712 0 118 3 09

    8 18 0 991 2 74 0 336 0 179 0 042 0 022 0 079 0 184 0 010 0 554 0 130 0 0672 0 135 0 0312 0 0161 1 69 0 394 0 2004 0 404 0 093 0 484 0 326 0 0762 0 0395 0 254 0 0595 0 0308 2 29 0 534 0 277 0 408 0 0949 0 0483 8 25 705 5762

    0 997 0 361 0 022 0 102 0 0599 0 0171 0 184 0 0515 0 0375 0 0328 0 273 0 0525 307 2955

    8 32 3 96 0 0841 0 0407 0 119 0 0578 0 560 0 271 0 219 0 106 1 502 0 728 0 617 0 299 0 189 0 0916 0 377 0 182 1 35 0 653 0 565 0 273 5 53 329 7103

    to the maximum likelihood estimates scale x ML ML 0 x ML 1 ML 0 x ML ML x ML 0 234 ML

    These estimates are shown in the third column As expected they resemble the least squares estimates although not enough that one would be content to use OLS for estimation The fth column in Table 22 3 gives estimates of the probit model estimated for the dependent variable qi 0 if yi 0 qi 1 if yi 0 If the speci cation of the tobit model is correct then the probit estimators should be consistent for 1 from the tobit model These estimates with standard errors computed using the delta method are shown in column 4 The results are surprisingly close especially given the results of the speci cation test considered later Finally columns 6 and 7 give the estimates for the truncated regression model that applies to the 150 nonlimit observations if the speci cation of the model is correct Here the results seem a bit less consistent Several speci cation tests were suggested for this model The Cragg Greene test for appropriate speci cation of Prob yi 0 is given in Section 22 3 4 b This test is easily carried out using the log likelihood values in the table The chi squared statistic which has seven degrees of freedom is 2 705 5762 307 2955 392 7103 11 141 which is smaller than the critical value of 14 067 We conclude that the tobit model is correctly speci ed the decision of whether or not is not different from the decision of how many given whether We now turn to the normality tests We emphasize that these tests are nonconstructive tests of the skewness and kurtosis of the distribution of A fortiori if we do reject the hypothesis that these values are 0 0 and 3 0 respectively then we can reject normality But that does not suggest what to do next We turn to that issue later The Chesher Irish and Pagan Vella chi squared statistics are 562 218 and 22 314 respectively The critical value is 5 99 so on the basis of both of these

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    777

    values the hypothesis of normality is rejected Thus both the probability model and the distributional framework are rejected by these tests Before leaving the tobit model we consider one additional aspect of the original speci cation The values above 4 in the observed data are not true observations on the response 7 is an estimate of the mean of observations that fall in the range 4 to 10 whereas 12 was chosen more or less arbitrarily for observations that were greater than 10 These observations represent 80 of the 601 observations or about 13 percent of the sample To some extent this coding scheme might be driving the results This point was not overlooked in the original study a linear speci cation was used for the estimated equation and it did not seem reasonable in this case given the range of explanatory variables to have a dependent variable that ranged from say 0 to 365 Fair 1978 p 55 The tobit model allows for censoring in both tails of the distribution Ignoring the results of the speci cation tests for the moment we will examine a doubly censored regression by recoding all observations that take the values 7 or 12 as 4 The model is thus y x y 0 y 4 if y 0 if y 4 4 xi y y if 0 y 4 The log likelihood is built up from three sets of terms ln L
    y 0

    ln

    0 xi 1 yi xi ln 0 y 4

    ln 1
    y 4



    Maximum likelihood estimates of the parameters of this model based on the doubly censored data appear in Table 22 4 The effect on the coef cient estimates is relatively minor but the effect on the estimates of the marginal effects is very large they are reduced by about 50 percent which makes sense With the original data increases in the index were associated with increases in y that could be from 3 to 7 or from 3 to 12 But with the data treated as censored y cannot increase past 4 Thus the range of variation is greatly reduced The numerical results are also suggestive Recall that the scale factor for the singly censored data was 0 2338 For the doubly censored variable this factor is 4 x 0 x 0 8930 0 7701 0 1229 The regression model
    TABLE 22 4

    Estimates of a Doubly Censored Tobit Model
    Left Censored at 0 Only Censored at Both 0 and 4 Estimate Standard Error Marginal Effect Standard Error Marginal Effect

    Variable

    Estimate

    Constant z2 z3 z5 z7 z8 E y x E x

    8 18 0 797 0 179 0 079 0 0420 0 554 0 135 0 130 1 69 0 404 0 394 0 326 0 254 0 0762 2 29 0 408 0 534 8 25 Prob nonlimit 0 2338 1 126

    7 90 2 804 0 178 0 080 0 0218 0 532 0 141 0 0654 1 62 0 424 0 199 0 324 0 254 0 0399 2 21 0 459 0 271 7 94 Prob nonlimit 0 1229 0 226

    Greene 50240

    book

    June 28 2002

    17 5

    778

    CHAPTER 22 Limited Dependent Variable and Duration Models

    for y has not changed much but the effect now is to assign the upper tail area to the censored region whereas before it was in the uncensored region The effect then is to reduce the scale roughly by this 0 107 from 0 234 to about 0 123 By construction the tobit model should only be viewed as an approximation for these data The dependent variable is a count not a continuous measurement Thus the testing results obtained earlier are not surprising The Poisson regression model or perhaps one of the many variants of it should be a preferable modeling framework Table 22 5 presents estimates of the Poisson and negative binomial regression model There is ample evidence of overdispersion in these data the t ratio on the estimated overdispersion parameter is 7 014 0 945 7 42 which is strongly suggestive The large absolute value of the coef cient is likewise suggestive Before proceeding to a model that speci cally accounts for overdispersion we can nd a candidate for its source at least to some degree As discussed earlier responses of 7 and 12 do not represent the actual counts It is unclear what the effect of the rst recoding would be since it might well be the mean of the observations in this group But the second is clearly a censored observation To remove both of these effects we have recoded both the values 7 and 12 as 4 and treated this observation appropriately as a censored observation with 4 denoting 4 or more As shown in the third and fourth sets of results in Table 22 5 the effect of this treatment of the data is greatly to reduce the measured effects which is the same effect we observed for the tobit model Although this step does remove a de ciency in the data it does not remove the overdispersion at this point the negative binomial model is still the preferred speci cation The tobit model remains the standard approach to modeling a dependent variable that displays a large cluster of limit values usually zeros But in these data it is clear that
    TABLE 22 5

    Model Estimates Based on the Poisson Distribution
    Poisson Regression Negative Binomial Regression Estimate Standard Error Marginal Effect Standard Error Marginal Effect

    Variable

    Estimate

    Based on Uncensored Poisson Distribution

    Constant z2 z3 z5 z7 z8 log L Constant z2 z3 z5 z7 z8 log L

    2 53 0 0322 0 116 0 354 0 0798 0 409 1427 037 1 90 0 0328 0 105 0 323 0 0798 0 390 747 7541

    0 197 0 00585 0 00991 0 0309 0 0194 0 0274

    0 0470 0 168 0 515 0 116 0 0596

    2 19 0 0262 0 0848 0 422 0 0604 0 431 7 01 728 2441 4 79 0 0166 0 174 0 723 0 0900 0 854 9 39 482 0505

    0 859 0 0180 0 0400 0 171 0 0908 0 167 0 945

    0 00393 0 127 0 632 0 0906 0 646

    Based on Poisson Distribution Right Censored at y 4

    0 283 0 00838 0 0140 0 0437 0 0275 0 0391

    0 0235 0 0754 0 232 0 0521 0 279

    1 16 0 0250 0 0568 0 198 0 116 0 216 1 36

    0 00428 0 045 0 186 0 0232 0 220

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    779

    Histogram for Variable YC 500

    375 Frequency

    250

    125

    0 0
    FIGURE 22 3

    1

    2 YC

    3

    4

    Histogram for Model Predictions

    the zero value represents something other than a censoring it is the outcome of a discrete decision Thus for this reason and based on the preceding results it seems appropriate to turn to a different model for this dependent variable The Poisson and negative binomial models look like an improvement but there remains a noteworthy problem Figure 22 3 shows a histogram of the actual values solid dark bars and predicted values from the negative binomial model estimated with the censored data lighter bars Predictions from the latter model are the integer values of E y x exp x As in the actual data values larger than 4 are censored to 4 Evidently the negative binomial model predicts the data fairly poorly In fact it is not hard to see why The source of the overdispersion in the data is not the extreme values on the right of the distribution it is the very large number of zeros on the left There are a large variety of models and permutations that one might turn to at this point We will conclude with just one of these Lambert s 1992 zero in ated Poisson ZIP model with a logit splitting model discussed in Section 21 9 6 and Example 21 12 The doubly censored count is the dependent variable in this model Mullahy s 1986 hurdle model is an alternative that might be considered The difference between these two is in the interpretation of the zero observations In the ZIP formulation the zero observations would be a mixture of never and not in the last year whereas the hurdle model assumes two distinct decisions whether or not and how many given yes The estimates of the parameters of the ZIP model are shown in Table 22 6 The Vuong statistic of 21 64 strongly supports the ZIP model over the Poisson model An attempt to combine the ZIP model with the negative binomial was unsuccessful Since as expected the ancillary model for the zeros accounted for the overdispersion in the data the negative binomial model degenerated to the Poisson form Finally

    Greene 50240

    book

    June 28 2002

    17 5

    780

    CHAPTER 22 Limited Dependent Variable and Duration Models

    TABLE 22 6

    Estimates of a Zero In ated Poisson Model
    Poisson Regression Logit Splitting Model Estimate Standard Error ZIP Marginal Effects Tobit 0 Tobit 0 4

    Variable

    1 1Estimate

    Standard Error

    Constant Age Years Religion Occupation Happiness

    1 27 0 00422 0 0331 0 0909 0 0205 0 817

    0 439 0 0122 0 0231 0 0721 0 0441 0 0666

    1 85 0 0397 0 0981 0 306 0 0677 0 458

    0 664 0 0190 0 0318 0 0951 0 0607 0 0949

    0 0252 0 0420 0 0987 0 130 0 288 0 394 0 0644 0 0762 0 344 0 534

    0 0218 0 0654 0 199 0 0399 0 271

    the marginal effects E y x x are shown in Table 22 6 for three models the ZIP model Fair s original tobit model and the tobit model estimated with the doubly censored count The estimates for the ZIP model are considerably lower than those for Fair s tobit model When the tobit model is reestimated with the censoring on the right however the resulting marginal effects are reasonably close to those from the ZIP model though uniformly smaller This result may be from not building the censoring into the ZIP model a re nement that would be relatively straightforward We conclude that the original tobit model provided only a fair approximation to the marginal effects produced by we contend the more appropriate speci cation of the Poisson model But the approximation became much better when the data were recorded and treated as censored Figure 22 3 also shows the predictions from the ZIP model narrow bars As might be expected it provides a much better prediction of the dependent variable The integer values of the conditional mean function for the tobit model were roughly evenly split between zeros and ones whereas the doubly censored model always predicted y 0 Surprisingly the treatment of the highest observations does greatly affect the outcome If the ZIP model is t to the original uncensored data then the vector of marginal effects is 0 0586 0 2446 0 692 0 115 0 787 which is extremely large Thus perhaps more analysis is called for the ZIP model can be further improved and one might reconsider the hurdle model but we have tortured Fair s data enough Further exploration is left for the reader

    22 4

    THE SAMPLE SELECTION MODEL

    The topic of sample selection or incidental truncation has been the subject of an enormous recent literature both theoretical and applied 18 This analysis combines both of the previous topics
    Example 22 6 Incidental Truncation

    In the high income survey discussed in Example 22 2 respondents were also included in the survey if their net worth not including their homes was at least 500 000 Suppose that
    18 A

    large proportion of the analysis in this framework has been in the area of labor economics The results however have been applied in many other elds including for example long series of stock market returns by nancial economists survivorship bias and medical treatment and response in long term studies by clinical researchers attrition bias The four surveys noted in the introduction to this chapter provide fairly extensive although far from exhaustive lists of the studies Some studies that comment on methodological issues are Heckman 1990 Manski 1989 1990 1992 and Newey Powell and Walker 1990

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    781

    the survey of incomes was based only on people whose net worth was at least 500 000 This selection is a form of truncation but not quite the same as in Section 22 2 This selection criterion does not necessarily exclude individuals whose incomes at the time might be quite low Still one would expect that on average individuals with a high net worth would have a high income as well Thus the average income in this subpopulation would in all likelihood also be misleading as an indication of the income of the typical American The data in such a survey would be nonrandomly selected or incidentally truncated

    Econometric studies of nonrandom sampling have analyzed the deleterious effects of sample selection on the properties of conventional estimators such as least squares have produced a variety of alternative estimation techniques and in the process have yielded a rich crop of empirical models In some cases the analysis has led to a reinterpretation of earlier results
    22 4 1 INCIDENTAL TRUNCATION IN A BIVARIATE DISTRIBUTION

    Suppose that y and z have a bivariate distribution with correlation We are interested in the distribution of y given that z exceeds a particular value Intuition suggests that if y and z are positively correlated then the truncation of z should push the distribution of y to the right As before we are interested in 1 the form of the incidentally truncated distribution and 2 the mean and variance of the incidentally truncated random variable Since it has dominated the empirical literature we will focus rst on the bivariate normal distribution 19 The truncated joint density of y and z is f y z z a f y z Prob z a

    To obtain the incidentally truncated marginal density for y we would then integrate z out of this expression The moments of the incidentally truncated normal distribution are given in Theorem 22 5 20

    THEOREM 22 5 Moments of the Incidentally Truncated Bivariate Normal Distribution If y and z have a bivariate normal distribution with means y and z standard deviations y and z and correlation then E y z a y y z
    2 Var y z a y 1 2 z

    22 19

    where z a z z z z 1

    z and z z z z

    19 We

    will reconsider the issue of the normality assumption in Section 22 4 5

    20 Much

    more general forms of the result that apply to multivariate distributions are given in Johnson and Kotz 1974 See also Maddala 1983 pp 266 267

    Greene 50240

    book

    June 28 2002

    17 5

    782

    CHAPTER 22 Limited Dependent Variable and Duration Models

    Note that the expressions involving z are analogous to the moments of the truncated distribution of x given in Theorem 22 2 If the truncation is z a then we make the replacement z z z As expected the truncated mean is pushed in the direction of the correlation if the truncation is from below and in the opposite direction if it is from above In addition the incidental truncation reduces the variance because both and 2 are between zero and one
    22 4 2 REGRESSION IN A MODEL OF SELECTION

    To motivate a regression model that corresponds to the results in Theorem 22 5 we consider two examples
    Example 22 7 A Model of Labor Supply

    A simple model of female labor supply that has been examined in many studies consists of two equations 21 1 Wage equation The difference between a person s market wage what she could command in the labor market and her reservation wage the wage rate necessary to make her choose to participate in the labor market is a function of characteristics such as age and education as well as for example number of children and where a person lives Hours equation The desired number of labor hours supplied depends on the wage home characteristics such as whether there are small children present marital status and so on

    2

    The problem of truncation surfaces when we consider that the second equation describes desired hours but an actual gure is observed only if the individual is working In most such studies only a participation equation that is whether hours are positive or zero is observable We infer from this that the market wage exceeds the reservation wage Thus the hours variable in the second equation is incidentally truncated

    To put the preceding examples in a general framework let the equation that determines the sample selection be zi w i ui and let the equation of primary interest be yi xi i The sample rule is that yi is observed only when zi is greater than zero Suppose as well that i and ui have a bivariate normal distribution with zero means and correlation Then we may insert these in Theorem 22 5 to obtain the model that applies to the observations in our sample E yi yi is observed E yi zi 0 E yi ui w i xi E i ui w i xi i u xi i i u
    21 See for example Heckman 1976 This strand of literature begins with an exchange by Gronau 1974 and

    Lewis 1974

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    783

    where u wi u and u wi u wi u So yi zi 0 E yi zi 0 vi xi i u vi Least squares regression using the observed data for instance OLS regression of hours on its determinants using only data for women who are working produces inconsistent estimates of Once again we can view the problem as an omitted variable Least squares regression of y on x and would be a consistent estimator but if is omitted then the speci cation error of an omitted variable is committed Finally note that the second part of Theorem 22 5 implies that even if i were observed then least squares would be inef cient The disturbance vi is heteroscedastic The marginal effect of the regressors on yi in the observed sample consists of two components There is the direct effect on the mean of yi which is In addition for a particular independent variable if it appears in the probability that zi is positive then it will in uence yi through its presence in i The full effect of changes in a regressor that appears in both xi and wi on y is E yi zi 0 k k xik where i i2 i i 22 Suppose that is positive and E yi is greater when zi is positive than when it is negative Since 0 i 1 the additional term serves to reduce the marginal effect The change in the probability affects the mean of yi in that the mean in the group zi 0 is higher The second term in the derivative compensates for this effect leaving only the marginal effect of a change given that zi 0 to begin with Consider Example 22 9 and suppose that education affects both the probability of migration and the income in either state If we suppose that the income of migrants is higher than that of otherwise identical people who do not migrate then the marginal effect of education has two parts one due to its in uence in increasing the probability of the individual s entering a higherincome group and one due to its in uence on income within the group As such the coef cient on education in the regression overstates the marginal effect of the education of migrants and understates it for nonmigrants The sizes of the various parts depend on the setting It is quite possible that the magnitude sign and statistical signi cance of the effect might all be different from those of the estimate of a point that appears frequently to be overlooked in empirical studies In most cases the selection variable z is not observed Rather we observe only its sign To consider our two examples we typically observe only whether a woman is working or not working or whether an individual migrated or not We can infer the sign of z but not its magnitude from such information Since there is no information on the scale of z the disturbance variance in the selection equation cannot be estimated We encountered this problem in Chapter 21 in connection with the probit model
    22 We have reversed the sign of

    u

    i u

    as such

    in 22 19 since a 0 and w M is somewhat more convenient Also

    Greene 50240

    book

    June 28 2002

    17 5

    784

    CHAPTER 22 Limited Dependent Variable and Duration Models

    Thus we reformulate the model as follows selection mechanism zi wi ui zi 1 if zi 0 and 0 otherwise Prob zi 1 wi wi and wi 22 20 Prob zi 0 wi 1 regression model

    yi xi i observed only if zi 1 ui i bivariate normal 0 0 1

    Suppose that as in many of these studies zi and wi are observed for a random sample of individuals but yi is observed only when zi 1 This model is precisely the one we examined earlier with E yi zi 1 xi wi xi e wi
    22 4 3 ESTIMATION

    The parameters of the sample selection model can be estimated by maximum likelihood 23 However Heckman s 1979 two step estimation procedure is usually used instead Heckman s method is as follows 24 1 Estimate the probit equation by maximum likelihood to obtain estimates of For each observation in the selected sample compute i wi wi and i i i wi Estimate and e by least squares regression of y on x and

    2

    It is possible also to construct consistent estimators of the individual parameters and At each observation the true conditional variance of the disturbance would be i2 2 1 2 i The average conditional variance for the sample would converge to plim 1 n
    n

    i2 2 1 2
    i 1

    which is what is estimated by the least squares residual variance e e n For the square of the coef cient on we have
    2 plim b 2 2

    whereas based on the probit results we have 1 plim n
    n

    i
    i 1

    We can then obtain a consistent estimator of 2 using 2
    23 See

    1 2 e e b n

    Greene 1995a

    24 Perhaps

    in a mimicry of the tobit estimator described earlier this procedure has come to be known as the Heckit estimator

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    785

    Finally an estimator of 2 is 2
    2 b 2

    which provides a complete set of estimators of the model s parameters 25 To test hypotheses an estimate of the asymptotic covariance matrix of b b is needed We have two problems to contend with First we can see in Theorem 22 5 that the disturbance term in yi zi 1 xi wi xi e i vi is heteroscedastic Var vi zi 1 xi wi 2 1 2 i Second there are unknown parameters in i Suppose that we assume for the moment that i and i are known i e we do not have to estimate For convenience let xi xi i and let b be the least squares coef cient vector in the regression of y on x in the selected data Then using the appropriate form of the variance of ordinary least squares in a heteroscedastic model from Chapter 11 we would have to estimate
    n

    22 21

    Var b 2 X X 1
    i 1

    1 2 i xi xi X X 1 2 X X X 1



    2 X X 1 X I

    where I 2 is a diagonal matrix with 1 2 i on the diagonal Without any other complications this result could be computed fairly easily using X the sample estimates of 2 and 2 and the assumed known values of i and i The parameters in do have to be estimated using the probit equation Rewrite 22 21 as yi zi 1 xi wi xi i vi i i In this form we see that in the preceding expression we have ignored both an additional source of variation in the compound disturbance and correlation across observations the same estimate of is used to compute i for every observation Heckman has shown that the earlier covariance matrix can be appropriately corrected by adding a term inside the brackets Q 2 X W Est Asy Var W X 2 FVF where V Est Asy Var the estimator of the asymptotic covariance of the probit coef cients Any of the estimators in 21 22 to 21 24 may be used to compute V The complete expression is Est Asy Var b b 2 X X 1 X I 2 X Q X X 1 26
    25 Note that 2 26 This

    is not a sample correlation and as such is not limited to 0 1 See Greene 1981 for discussion

    matrix formulation is derived in Greene 1981 Note that the Murphy and Topel 1985 results for two step estimators given in Theorem 10 3 would apply here as well Asymptotically this method would give the same answer The Heckman formulation has become standard in the literature

    Greene 50240

    book

    June 28 2002

    17 5

    786

    CHAPTER 22 Limited Dependent Variable and Duration Models

    TABLE 22 7

    Estimated Selection Corrected Wage Equation
    Two Step Maximum Likelihood Estimate Std Err Least Squares Estimate Std Err Std Err

    Estimate

    1 2 3 4 5

    0 971 0 021 0 000137 0 417 0 444 1 100 0 340 3 200

    2 06 0 0625 0 00188 0 100 0 316 0 127

    0 632 0 00897 0 334d 4 0 147 0 144 0 131 0 321

    1 063 0 000678 0 782d 7 0 0142 0 0614 0 218 0 00866

    2 56 0 0325 0 000260 0 481 0 449 0 000 3 111

    0 929 0 0616 0 00184 0 0669 0 449

    Example 22 8

    Female Labor Supply

    Examples 21 1 and 21 4 proposed a labor force participation model for a sample of 753 married women in a sample analyzed by Mroz 1987 The data set contains wage and hours information for the 428 women who participated in the formal market LFP 1 Following Mroz we suppose that for these 428 individuals the offered wage exceeded the reservation wage and moreover the unobserved effects in the two wage equations are correlated As such a wage equation based on the market data should account for the sample selection problem We specify a simple wage model wage 1 2 Exper 3 Exper 2 4 Education 5 City where Exper is labor market experience and City is a dummy variable indicating that the individual lived in a large urban area Maximum likelihood Heckman two step and ordinary least squares estimates of the wage equation are shown in Table 22 7 The maximum likelihood estimates are FIML estimates the labor force participation equation is reestimated at the same time Only the parameters of the wage equation are shown below Note as well that the two step estimator estimates the single coef cient on i and the structural parameters and are deduced by the method of moments The maximum likelihood estimator computes estimates of these parameters directly Details on maximum likelihood estimation may be found in Maddala 1983 The differences between the two step and maximum likelihood estimates in Table 22 7 are surprisingly large The difference is even more striking in the marginal effects The effect for education is estimated as 0 417 0 0641 for the two step estimators and 0 149 in total for the maximum likelihood estimates For the kids variable the marginal effect is 293 for the two step estimates and only 0 0113 for the MLEs Surprisingly the direct test for a selection effect in the maximum likelihood estimates a nonzero fails to reject the hypothesis that equals zero

    In some settings the selection process is a nonrandom sorting of individuals into two or more groups The mover stayer model in the next example is a familiar case
    Example 22 9 A Mover Stayer Model for Migration

    The model of migration analyzed by Nakosteen and Zimmer 1980 ts into the framework described above The equations of the model are net bene t of moving Mi wi ui income if moves income if stays I i 1 xi 1 1 i 1 I i 0 xi 0 0 i 0

    One component of the net bene t is the market wage individuals could achieve if they move compared with what they could obtain if they stay Therefore among the determinants of

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    787

    TABLE 22 8

    Estimated Earnings Equations
    Migration Migrant Earnings Nonmigrant Earnings

    Constant SE EMP PCI Age Race Sex SIC

    1 509 0 708 5 72 1 488 2 60 1 455 3 14 0 008 5 29 0 065 1 17 0 082 2 14 0 948 24 15

    9 041 4 104 9 54 0 790 2 24 0 212 0 50

    8 593 4 161 57 71 0 927 9 35 0 863 2 84

    the net bene t are factors that also affect the income received in either place An analysis of income in a sample of migrants must account for the incidental truncation of the mover s income on a positive net bene t Likewise the income of the stayer is incidentally truncated on a nonpositive net bene t The model implies an income after moving for all observations but we observe it only for those who actually do move Nakosteen and Zimmer 1980 applied the selectivity model to a sample of 9 223 individuals with data for 2 years 1971 and 1973 sampled from the Social Security Administration s Continuous Work History Sample Over the period 1 078 individuals migrated and the remaining 8 145 did not The independent variables in the migration equation were as follows SE self employment dummy variable 1 if yes EMP rate of growth of state employment PCI growth of state per capita income x age race nonwhite 1 sex female 1 SIC 1 if individual changes industry The earnings equations included SIC and SE The authors reported the results given in Table 22 8 The gures in parentheses are asymptotic t ratios
    22 4 4 TREATMENT EFFECTS

    The basic model of selectivity outlined earlier has been extended in an impressive variety of directions 27 An interesting application that has found wide use is the measurement of treatment effects and program effectiveness 28 An earnings equation that accounts for the value of a college education is earningsi xi Ci i where Ci is a dummy variable indicating whether or not the individual attended college The same format has been used in any number of other analyses of programs experiments and treatments The question is Does measure the value of a college education
    27 For

    a survey see Maddala 1983

    28 This

    is one of the fundamental applications of this body of techniques and is also the setting for the most longstanding and contentious debate on the subject A Journal of Business and Economic Statistics symposium Angrist et al 2001 raised many of the important questions on whether and how it is possible to measure treatment effects

    Greene 50240

    book

    June 28 2002

    17 5

    788

    CHAPTER 22 Limited Dependent Variable and Duration Models

    assuming that the rest of the regression model is correctly speci ed The answer is no if the typical individual who chooses to go to college would have relatively high earnings whether or not he or she went to college The problem is one of self selection If our observation is correct then least squares estimates of will actually overestimate the treatment effect The same observation applies to estimates of the treatment effects in other settings in which the individuals themselves decide whether or not they will receive the treatment To put this in a more familiar context suppose that we model program participation e g whether or not the individual goes to college as Ci wi ui Ci 1 if Ci 0 0 otherwise We also suppose that consistent with our previous conjecture ui and i are correlated Coupled with our earnings equation we nd that E yi Ci 1 xi zi xi E i Ci 1 xi zi xi wi 22 22

    once again See 22 19 Evidently a viable strategy for estimating this model is to use the two step estimator discussed earlier The net result will be a different estimate of that will account for the self selected nature of program participation For nonparticipants the counterpart to 22 22 is E yi Ci 0 xi zi xi wi 1 wi i

    The difference in expected earnings between participants and nonparticipants is then E yi Ci 1 xi zi E yi Ci 0 xi zi
    i 1 i



    If the selectivity correction i is omitted from the least squares regression then this difference is what is estimated by the least squares coef cient on the treatment dummy variable But since by assumption all terms are positive we see that least squares overestimates the treatment effect Note nally that simply estimating separate equations for participants and nonparticipants does not solve the problem In fact doing so would be equivalent to estimating the two regressions of Example 22 9 by least squares which as we have seen would lead to inconsistent estimates of both sets of parameters There are many variations of this model in the empirical literature They have been applied to the analysis of education 29 the Head Start program 30 and a host of other settings 31 This strand of literature is particularly important because the use of dummy variable models to analyze treatment effects and program participation has a long
    29 Willis

    and Rosen 1979 1972

    30 Goldberger

    31 A useful summary of the issues is Barnow Cain and Goldberger 1981 See also Maddala 1983 for a long

    list of applications A related application is the switching regression model See for example Quandt 1982 1988

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    789

    history in empirical economics This analysis has called into question the interpretation of a number of received studies
    22 4 5 THE NORMALITY ASSUMPTION

    Some research has cast some skepticism on the selection model based on the normal distribution See Goldberger 1983 for an early salvo in this literature Among the ndings are that the parameter estimates are surprisingly sensitive to the distributional assumption that underlies the model Of course this fact in itself does not invalidate the normality assumption but it does call its generality into question On the other hand the received evidence is convincing that sample selection in the abstract raises serious problems distributional questions aside The literature for example Duncan 1986b Manski 1989 1990 and Heckman 1990 has suggested some promising approaches based on robust and nonparametric estimators These approaches obviously have the virtue of greater generality Unfortunately the cost is that they generally are quite limited in the breadth of the models they can accommodate That is one might gain the robustness of a nonparametric estimator at the cost of being unable to make use of the rich set of accompanying variables usually present in the panels to which selectivity models are often applied For example the nonparametric bounds approach of Manski 1990 is de ned for two regressors Other methods e g Duncan 1986b allow more elaborate speci cation Recent research includes speci c attempts to move away from the normality assumption 32 An example is Martins 2001 building on Newey 1991 which takes the core speci cation as given in 22 20 as the platform but constructs an alternative to the assumption of bivariate normality Martins speci cation modi es the Heckman model by employing an equation of the form E yi zi 1 xi wi xi wi where the latter selectivity correction is not the inverse Mills ratio but some other result from a different model The correction term is estimated using the Klein and Spady model discussed in Section 21 5 4 This is labeled a semiparametric approach Whether the conditional mean in the selected sample should even remain a linear index function remains to be settled Not surprisingly Martins results based on two step least squares differ only slightly from the conventional results based on normality This approach is arguably only a fairly small step away from the tight parameterization of the Heckman model Other non and semiparametric speci cations e g Honore and Kyriazidou 1999 2000 represent more substantial departures from the normal model but are much less operational 33 The upshot is that the issue remains unsettled For better or worse the empirical literature on the subject continues to be dominated by Heckman s original model built around the joint normal distribution
    32 Again 33 This

    Angrist et al 2001 is an important contribution to this literature

    particular work considers selection in a panel mainly two periods But the panel data setting for sample selection models is more involved than a cross section analysis In a panel data set the selection is likely to be a decision at the beginning of Period 1 to be in the data set for all subsequent periods As such something more intricate than the model we have considered here is called for

    Greene 50240

    book

    June 28 2002

    17 5

    790

    CHAPTER 22 Limited Dependent Variable and Duration Models 22 4 6 SELECTION IN QUALITATIVE RESPONSE MODELS

    The problem of sample selection has been modeled in other settings besides the linear regression model In Section 21 6 4 we saw for example an application of what amounts to a model of sample selection in a bivariate probit model a binary response variable yi 1 if an individual defaults on a loan is observed only if a related variable zi equals one the individual is granted a loan Greene s 1992 application to credit card applications and defaults is similar A current strand of literature has developed several models of sample selection for count data models 34 Terza 1995 models the phenomenon as a form of heterogeneity in the Poisson model We write yi i Poisson i ln i i xi i zi wi ui zi 1 if zi 0 0 otherwise and i ui have a bivariate normal distribution with the same speci cation as in our earlier model As before we assume that yi xi are only observed when zi 1 Thus the effect of the selection is to affect the mean and variance of yi although the effect on the distribution is unclear In the observed data yi no longer has a Poisson distribution Terza 1998 Terza and Kenkel 2001 and Greene 1997a suggested a maximum likelihood approach for estimation 22 23

    Then the sample selection is similar to that discussed in the previous sections with

    22 5

    MODELS FOR DURATION DATA 35

    Intuition might suggest that the longer a strike persists the more likely it is that it will end within say the next week Or is it It seems equally plausible to suggest that the longer a strike has lasted the more dif cult must be the problems that led to it in the rst place and hence the less likely it is that it will end in the next short time interval A similar kind of reasoning could be applied to spells of unemployment or the interval between conceptions In each of these cases it is not only the duration of the event per se that is interesting but also the likelihood that the event will end in the next period given that it has lasted as long as it has Analysis of the length of time until failure has interested engineers for decades For example the models discussed in this section were applied to the durability of electric and electronic components long before economists discovered their usefulness
    34 See for example Bockstael et al 1990 Smith 1988 Brannas 1995 Greene 1994 1995c 1997a Weiss

    1995 and Terza 1995 1998 and Winkelmann 1997
    35 There

    are a large number of highly technical articles on this topic but relatively few accessible sources for the uninitiated A particularly useful introductory survey is Kiefer 1988 upon which we have drawn heavily for this section Other useful sources are Kalb eisch and Prentice 1980 Heckman and Singer 1984a Lancaster 1990 and Florens Fougere and Mouchart 1996

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    791

    Likewise the analysis of survival times for example the length of survival after the onset of a disease or after an operation such as a heart transplant has long been a staple of biomedical research Social scientists have recently applied the same body of techniques to strike duration length of unemployment spells intervals between conception time until business failure length of time between arrests length of time from purchase until a warranty claim is made intervals between purchases and so on This section will give a brief introduction to the econometric analysis of duration data As usual we will restrict our attention to a few straightforward relatively uncomplicated techniques and applications primarily to introduce terms and concepts The reader can then wade into the literature to nd the extensions and variations We will concentrate primarily on what are known as parametric models These apply familiar inference techniques and provide a convenient departure point Alternative approaches are considered at the end of the discussion
    22 5 1 DURATION DATA

    The variable of interest in the analysis of duration is the length of time that elapses from the beginning of some event either until its end or until the measurement is taken which may precede termination Observations will typically consist of a cross section of durations t1 t2 tn The process being observed may have begun at different points in calendar time for the different individuals in the sample For example the strike duration data examined in Example 22 10 are drawn from nine different years Censoring is a pervasive and usually unavoidable problem in the analysis of duration data The common cause is that the measurement is made while the process is ongoing An obvious example can be drawn from medical research Consider analyzing the survival times of heart transplant patients Although the beginning times may be known with precision at the time of the measurement observations on any individuals who are still alive are necessarily censored Likewise samples of spells of unemployment drawn from surveys will probably include some individuals who are still unemployed at the time the survey is taken For these individuals duration or survival is at least the observed ti but not equal to it Estimation must account for the censored nature of the data for the same reasons as considered in Section 22 3 The consequences of ignoring censoring in duration data are similar to those that arise in regression analysis In a conventional regression model that characterizes the conditional mean and variance of a distribution the regressors can be taken as xed characteristics at the point in time or for the individual for which the measurement is taken When measuring duration the observation is implicitly on a process that has been under way for an interval of time from zero to t If the analysis is conditioned on a set of covariates the counterparts to regressors xt then the duration is implicitly a function of the entire time path of the variable x t t 0 t which may have changed during the interval For example the observed duration of employment in a job may be a function of the individual s rank in the rm But their rank may have changed several times between the time they were hired and when the observation was made As such observed rank at the end of the job tenure is not necessarily a complete description of the individual s rank while they were employed Likewise marital status family size and amount of education are all variables that can change during the duration of unemployment and

    Greene 50240

    book

    June 28 2002

    17 5

    792

    CHAPTER 22 Limited Dependent Variable and Duration Models

    that one would like to account for in the duration model The treatment of time varying covariates is a considerable complication 36
    22 5 2 A REGRESSION LIKE APPROACH PARAMETRIC MODELS OF DURATION

    We will use the term spell as a catchall for the different duration variables we might measure Spell length is represented by the random variable T A simple approach to duration analysis would be to apply regression analysis to the sample of observed spells By this device we could characterize the expected duration perhaps conditioned on a set of covariates whose values were measured at the end of the period We could also assume that conditioned on an x that has remained xed from T 0 to T t t has a normal distribution as we commonly do in regression We could then characterize the probability distribution of observed duration times But normality turns out not to be particularly attractive in this setting for a number of reasons not least of which is that duration is positive by construction while a normally distributed variable can take negative values Lognormality turns out to be a palatable alternative but it is only one among a long list of candidates
    22 5 2 a Theoretical Background

    Suppose that the random variable T has a continuous probability distribution f t where t is a realization of T The cumulative probability is F t
    0 t

    f s ds Prob T t

    We will usually be more interested in the probability that the spell is of length at least t which is given by the survival function S t 1 F t Prob T t Consider the question raised in the introduction Given that the spell has lasted until time t what is the probability that it will end in the next short interval of time say t It is l t t Prob t T t Prob t T t t 0 t t T t t T t F t t F t f t t S t S t

    A useful function for characterizing this aspect of the distribution is the hazard rate t lim lim
    t 0

    Roughly the hazard rate is the rate at which spells are completed after duration t given that they last at least until t As such the hazard function gives an answer to our original question The hazard function the density the CDF and the survival function are all related The hazard function is d ln S t t dt
    36 See

    Petersen 1986 for one approach to this problem

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    793

    so f t S t t Another useful function is the integrated hazard function t
    0 t

    s ds

    for which S t e so t ln S t The integrated hazard function is generalized residual in this setting See Chesher and Irish 1987 and Example 22 10
    22 5 2 b Models of the Hazard Function
    t



    For present purposes the hazard function is more interesting than the survival rate or the density Based on the previous results one might consider modeling the hazard function itself rather than say modeling the survival function then obtaining the density and the hazard For example the base case for many analyses is a hazard rate that does not vary over time That is t is a constant This is characteristic of a process that has no memory the conditional probability of failure in a given short interval is the same regardless of when the observation is made Thus t From the earlier de nition we obtain the simple differential equation d ln S t dt The solution is ln S t k t or S t Ke t where K is the constant of integration The terminal condition that S 0 1 implies that K 1 and the solution is S t e t This solution is the exponential distribution which has been used to model the time until failure of electronic components Estimation of is simple since with an exponential distribution E t 1 The maximum likelihood estimator of would be the reciprocal of the sample mean A natural extension might be to model the hazard rate as a linear function t t Then t t 1 t 2 and f t t S t t exp t To avoid a nega2 tive hazard function one might depart from t exp g t where is a vector of parameters to be estimated With an observed sample of durations estimation of and

    Greene 50240

    book

    June 28 2002

    17 5

    794

    CHAPTER 22 Limited Dependent Variable and Duration Models

    TABLE 22 9 Distribution

    Survival Distributions
    Hazard Function t Survival Function S t

    Exponential Weibull Lognormal Loglogistic

    S t e t p p t p 1 S t e t f t p t p ln t S t p ln t ln t is normally distributed with mean ln and standard deviation 1 p t p t p 1 1 t p S t 1 1 t p ln t has a logistic distribution with mean ln and variance 2 3 p2

    is at least in principle a straightforward problem in maximum likelihood Kennan 1985 used a similar approach A distribution whose hazard function slopes upward is said to have positive duration dependence For such distributions the likelihood of failure at time t conditional upon duration up to time t is increasing in t The opposite case is that of decreasing hazard or negative duration dependence Our question in the introduction about whether the strike is more or less likely to end at time t given that it has lasted until time t can be framed in terms of positive or negative duration dependence The assumed distribution has a considerable bearing on the answer If one is unsure at the outset of the analysis whether the data can be characterized by positive or negative duration dependence then it is counterproductive to assume a distribution that displays one characteristic or the other over the entire range of t Thus the exponential distribution and our suggested extension could be problematic The literature contains a cornucopia of choices for duration models normal inverse normal inverse Gaussian see Lancaster 1990 lognormal F gamma Weibull which is a popular choice and many others 37 To illustrate the differences we will examine a few of the simpler ones Table 22 9 lists the hazard functions and survival functions for four commonly used distributions Each involves two parameters a location parameter and a scale parameter p Note that in the benchmark case of the exponential distribution is the hazard function In all other cases the hazard function is a function of p and where there is duration dependence t as well Different authors e g Kiefer 1988 use different parameterizations of these models We follow the convention of Kalb eisch and Prentice 1980 All these are distributions for a nonnegative random variable Their hazard functions display very different behaviors as can be seen in Figure 22 4 The hazard function for the exponential distribution is constant that for the Weibull is monotonically increasing or decreasing depending on p and the hazards for lognormal and loglogistic distributions rst increase and then decrease Which among these or the many alternatives is likely to be best in any application is uncertain
    22 5 2 c Maximum Likelihood Estimation

    The parameters and p of these models can be estimated by maximum likelihood For observed duration data t1 t2 tn the log likelihood function can be formulated and maximized in the ways we have become familiar with in earlier chapters Censored observations can be incorporated as in Section 22 3 for the tobit model See 22 13
    37 Three sources that contain numerous speci cations are Kalb eisch and Prentice 1980 Cox and Oakes 1985 and Lancaster 1990

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    795

    Hazard function 0 040 Lognormal 0 032 Loglogistic 0 024 Exponential

    0 016

    Weibull

    0 008

    0 0 20 40 Days
    FIGURE 22 4 Parametric Hazard Functions

    60

    80

    100

    As such ln L
    uncensored observations

    ln f t
    censored observations

    ln S t

    where p For some distributions it is convenient to formulate the log likelihood function in terms of f t t S t so that ln L
    uncensored observations

    t
    all observations

    ln S t

    Inference about the parameters can be done in the usual way Either the BHHH estimator or actual second derivatives can be used to estimate asymptotic standard errors for the estimates The transformation w p ln t ln for these distributions greatly facilitates maximum likelihood estimation For example for the Weibull model by de ning w p ln t ln we obtain the very simple density f w exp w exp w and survival function S w exp exp w 38 Therefore by using ln t instead of t we greatly simplify the log likelihood function Details for these and several other distributions may be found in Kalb eisch and Prentice 1980 pp 56 60 The Weibull distribution is examined in detail in the next section
    38 The

    transformation is exp w t p so t 1 exp w 1 p The Jacobian of the transformation is dt dw exp w 1 p p The density in Table 22 9 is p exp w 1 p 1 exp exp w Multiplying by the Jacobian produces the result f w exp w exp w The survival function is the antiderivative exp exp w

    Greene 50240

    book

    June 28 2002

    17 5

    796

    CHAPTER 22 Limited Dependent Variable and Duration Models 22 5 2 d Exogenous Variables

    One limitation of the models given above is that external factors are not given a role in the survival distribution The addition of covariates to duration models is fairly straightforward although the interpretation of the coef cients in the model is less so Consider for example the Weibull model The extension to other distributions will be similar Let i e xi where xi is a constant term and a set of variables that are assumed not to change from time T 0 until the failure time T ti Making i a function of a set of regressors is equivalent to changing the units of measurement on the time axis For this reason these models are sometimes called accelerated failure time models Note as well that in all the models listed and generally the regressors do not bear on the question of duration dependence which is a function of p Let 1 p and let i 1 if the spell is completed and i 0 if it is censored As before let ln ti xi and denote the density and survival functions f wi and S wi The observed random variable is wi p ln i ti ln ti wi xi The Jacobian of the transformation from wi to ln ti is d wi d ln ti 1 so the density and survival functions for ln ti are f ln ti xi 1 f
    n

    ln ti xi

    and S ln ti xi S

    ln ti xi

    The log likelihood for the observed data is ln L data
    i 1

    i ln f ln ti xi 1 i ln S ln ti xi

    For the Weibull model for example see footnote 38 f wi exp wi ewi and S wi exp ewi Making the transformation to ln ti and collecting terms reduces the log likelihood to ln L data
    i

    i

    ln ti xi ln

    exp

    ln ti xi



    Many other distributions including the others in Table 22 9 simplify in the same way The exponential model is obtained by setting to one The derivatives can be equated to zero using the methods described in Appendix E The individual terms can also be used

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    797

    to form the BHHH estimator of the asymptotic covariance matrix for the estimator 39 The Hessian is also simple to derive so Newton s method could be used instead 40 Note that the hazard function generally depends on t p and x The sign of an estimated coef cient suggests the direction of the effect of the variable on the hazard function when the hazard is monotonic But in those cases such as the loglogistic in which the hazard is nonmonotonic even this may be ambiguous The magnitudes of the effects may also be dif cult to interpret in terms of the hazard function In a few cases we do get a regression like interpretation In the Weibull and exponential models E t xi exp xi 1 p 1 whereas for the lognormal and loglogistic models E ln t xi xi In these cases k is the derivative or a multiple of the derivative of this conditional mean For some other distributions the conditional median of t is easily obtained Numerous cases are discussed by Kiefer 1988 Kalb eisch and Prentice 1980 and Lancaster 1990
    22 5 2 e Heterogeneity

    The problem of heterogeneity in duration models can be viewed essentially as the result of an incomplete speci cation Individual speci c covariates are intended to incorporate observation speci c effects But if the model speci cation is incomplete and if systematic individual differences in the distribution remain after the observed effects are accounted for then inference based on the improperly speci ed model is likely to be problematic We have already encountered several settings in which the possibility of heterogeneity mandated a change in the model speci cation the xed and random effects regression logit and probit models all incorporate observation speci c effects Indeed all the failures of the linear regression model discussed in the preceding chapters can be interpreted as a consequence of heterogeneity arising from an incomplete speci cation There are a number of ways of extending duration models to account for heterogeneity The strictly nonparametric approach of the Kaplan Meier estimator see Section 22 5 3 is largely immune to the problem but it is also rather limited in how much information can be culled from it One direct approach is to model heterogeneity in the parametric model Suppose that we posit a survival function conditioned on the individual speci c effect vi We treat the survival function as S ti vi Then add to that a model for the unobserved heterogeneity f vi Note that this is a counterpart to the incorporation of a disturbance in a regression model and follows the same procedures that we used in the Poisson model with random effects Then S t Ev S t v
    v

    S t v f v dv

    The gamma distribution is frequently used for this purpose 41 Consider for example using this device to incorporate heterogeneity into the Weibull model we used earlier As is typical we assume that v has a gamma distribution with mean 1 and variance
    39 Note

    that the log likelihood function has the same form as that for the tobit model in Section 22 3 By just reinterpreting the nonlimit observations in a tobit setting we can therefore use this framework to apply a wide range of distributions to the tobit model See Greene 1995a and references given therein Kalb eisch and Prentice 1980 for numerous other examples for example Hausman Hall and Griliches 1984 who use it to incorporate heterogeneity in the Poisson regression model The application is developed in Section 21 9 5

    40 See 41 See

    Greene 50240

    book

    June 28 2002

    17 5

    798

    CHAPTER 22 Limited Dependent Variable and Duration Models

    1 k Then f v and S t v e v t
    p

    kk kv k 1 ev k

    After a bit of manipulation we obtain the unconditional distribution S t
    0

    S t v f v dv 1 t p 1

    The limiting value with 0 is the Weibull survival model so 0 corresponds to Var v 0 or no heterogeneity 42 The hazard function for this model is t p t p 1 S t which shows the relationship to the Weibull model This approach is common in parametric modeling of heterogeneity In an important paper on this subject Heckman and Singer 1984b argued that this approach tends to overparameterize the survival distribution and can lead to rather serious errors in inference They gave some dramatic examples to make the point They also expressed some concern that researchers tend to choose the distribution of heterogeneity more on the basis of mathematical convenience than on any sensible economic basis
    22 5 3 OTHER APPROACHES

    The parametric models are attractive for their simplicity But by imposing as much structure on the data as they do the models may distort the estimated hazard rates It may be that a more accurate representation can be obtained by imposing fewer restrictions The Kaplan Meier 1958 product limit estimator is a strictly empirical nonparametric approach to survival and hazard function estimation Assume that the observations on duration are sorted in ascending order so that t1 t2 and so on and for now that no observations are censored Suppose as well that there are K distinct survival times in the data denoted Tk K will equal n unless there are ties Let nk denote the number of individuals whose observed duration is at least Tk The set of individuals whose duration is at least Tk is called the risk set at this duration We borrow once again from biostatistics where the risk set is those individuals still at risk at time Tk Thus nk is the size of the risk set at time Tk Let hk denote the number of observed spells completed at time Tk A strictly empirical estimate of the survivor function would be
    k

    S Tk
    i 1

    ni hi ni hi ni n1

    42 For

    the strike data analyzed earlier the maximum likelihood estimate of is 0 0004 which suggests that at least in the context of the Weibull model heterogeneity does not appear to be a problem

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    799

    The estimator of the hazard rate is Tk hk nk 22 24

    Corrections are necessary for observations that are censored Lawless 1982 Kalb eisch and Prentice 1980 Kiefer 1988 and Greene 1995a give details Susin 2001 points out a fundamental ambiguity in this calculation one which he argues appears in the 1958 source The estimator in 22 24 is not a rate as such as the width of the time window is unde ned and could be very different at different points in the chain of calculations Since many intervals particularly those late in the observation period might have zeros the failure to acknowledge these intervals should impart an upward bias to the estimator His proposed alternative computes the counterpart to 22 24 over a mesh of de ned intervals as follows b Ia
    b j a b j a

    hj

    n j bj

    where the interval is from t a to t b h j is the number of failures in each period in this interval n j is the number of individuals at risk in that period and b j is the width of the period Thus an interval a b is likely to include several periods Cox s 1972 approach to the proportional hazard model is another popular semiparametric method of analyzing the effect of covariates on the hazard rate The model speci es that ti exp xi 0 ti The function 0 is the baseline hazard which is the individual heterogeneity In principle this hazard is a parameter for each observation that must be estimated Cox s partial likelihood estimator provides a method of estimating without requiring estimation of 0 The estimator is somewhat similar to Chamberlain s estimator for the logit model with panel data in that a conditioning operation is used to remove the heterogeneity See Section 21 5 1 b Suppose that the sample contains K distinct exit times T1 TK For any time Tk the risk set denoted Rk is all individuals whose exit time is at least Tk The risk set is de ned with respect to any moment in time T as the set of individuals who have not yet exited just prior to that time For every individual i in risk set Rk ti Tk The probability that an individual exits at time Tk given that exactly one individual exits at this time which is the counterpart to the conditioning in the binary logit model in Chapter 21 is Prob ti Tk risk setk e xi xj j Rk e

    Thus the conditioning sweeps out the baseline hazard functions For the simplest case in which exactly one individual exits at each distinct exit time and there are no censored observations the partial log likelihood is
    K

    ln L
    k 1

    xk ln
    j Rk

    e x j

    Greene 50240

    book

    June 28 2002

    17 5

    800

    CHAPTER 22 Limited Dependent Variable and Duration Models

    TABLE 22 10

    Estimated Duration Models Estimated Standard Errors in Parentheses
    p Median Duration

    Exponential Weibull Loglogistic Lognormal

    0 02344 0 00298 0 02439 0 00354 0 04153 0 00707 0 04514 0 00806

    1 00000 0 00000 0 92083 0 11086 1 33148 0 17201 0 77206 0 08865

    29 571 3 522 27 543 3 997 24 079 4 102 22 152 3 954

    If mk individuals exit at time Tk then the contribution to the log likelihood is the sum of the terms for each of these individuals The proportional hazard model is a common choice for modeling durations because it is a reasonable compromise between the Kaplan Meier estimator and the possibly excessively structured parametric models Hausman and Han 1990 and Meyer 1988 among others have devised other semiparametric speci cations for hazard models
    Example 22 10 Survival Models for Strike Duration

    The strike duration data given in Kennan 1985 pp 14 16 have become a familiar standard for the demonstration of hazard models Appendix Table F22 1 lists the durations in days of 62 strikes that commenced in June of the years 1968 to 1976 Each involved at least 1 000 workers and began at the expiration or reopening of a contract Kennan reported the actual duration In his survey Kiefer using the same observations censored the data at 80 days to demonstrate the effects of censoring We have kept the data in their original form the interested reader is referred to Kiefer for further analysis of the censoring problem 43 Parameter estimates for the four duration models are given in Table 22 10 The estimate of the median of the survival distribution is obtained by solving the equation S t 0 5 For example for the Weibull model S M 0 5 exp M P or M ln 2 1 p For the exponential model p 1 For the lognormal and loglogistic models M 1 The delta method is then used to estimate the standard error of this function of the parameter estimates See Section 5 2 4 All these distributions are skewed to the right As such E t is greater than the median For the exponential and Weibull models E t 1 1 p 1 for the normal E t 1 exp 1 p2 1 2 The implied hazard functions are shown in Figure 22 4 The variable x reported with the strike duration data is a measure of unanticipated aggregate industrial production net of seasonal and trend components It is computed as the residual in a regression of the log of industrial production in manufacturing on time time squared and monthly dummy variables With the industrial production variable included as a covariate the estimated Weibull model is ln 3 7772 9 3515 x p 1 00288 0 1394 2 973 0 1217 median strike length 27 35 3 667 days E t 39 83 days Note that the Weibull model is now almost identical to the exponential model p 1 Since the hazard conditioned on x is approximately equal to i it follows that the hazard function is increasing in unexpected industrial production A one percent increase in x leads to a 9 35 percent increase in which since p 1 translates into a 9 35 percent decrease in the median strike length or about 2 6 days Note that M ln 2
    43 Our

    statistical results are nearly the same as Kiefer s despite the censoring

    Greene 50240

    book

    June 28 2002

    17 5

    CHAPTER 22 Limited Dependent Variable and Duration Models

    801

    The proportional hazard model does not have a constant term The baseline hazard is an individual speci c constant The estimate of is 9 0726 with an estimated standard error of 3 225 This is very similar to the estimate obtained for the Weibull model

    22 6

    SUMMARY AND CONCLUSIONS

    This chapter has examined three settings in which in principle the linear regression model of Chapter 2 would apply but the data generating mechanism produces a nonlinear form In the truncated regression model the range of the dependent variable is restricted substantively Certainly all economic data are restricted in this way aggregate income data cannot be negative for example But when data are truncated so that plausible values of the dependent variable are precluded for example when zero values for expenditure are discarded the data that remain are analyzed with models that explicitly account for the truncation When data are censored values of the dependent variable that could in principle be observed are masked Ranges of values of the true variable being studied are observed as a single value The basic problem this presents for model building is that in such a case we observe variation of the independent variables without the corresponding variation in the dependent variable that might be expected Finally the issue of sample selection arises when the observed data are not drawn randomly from the population of interest Failure to account for this nonrandom sampling produces a model that describes only the nonrandom subsample not the larger population In each case we examined the model speci cation and estimation techniques which are appropriate for these variations of the regression model Maximum likelihood is usually the method of choice but for the third case a two step estimator has become more common In the nal section we examined an application models of duration which describe variables with limited nonnegative ranges of variation and which are often observed subject to censoring

    Key Terms and Concepts
    Accelerated failure time Attenuation Censored regression Censored variable Censoring Conditional moment test Count data Degree of truncation Delta method Duration dependence Duration model Generalized residual Hazard function Hazard rate Heterogeneity Heteroscedasticity Incidental truncation Integrated hazard function Inverse Mills ratio Lagrange multiplier test Marginal effects Negative duration Semiparametric model Speci cation error Survival function Time varying covariate Tobit model Treatment effect Truncated bivariate normal

    dependence
    Olsen s reparameterization Parametric model Partial likelihood Positive duration

    distribution
    Truncated distribution Truncated mean Truncated random variable Truncated variance Two step estimation Weibull model

    dependence
    Product limit Proportional hazard Risk set Sample selection

    Greene 50240

    book

    June 28 2002

    17 5

    802

    CHAPTER 22 Limited Dependent Variable and Duration Models

    Exercises 1 The following 20 observations are drawn from a censored normal distribution
    3 8396 5 7971 0 00000 1 2526 7 2040 7 0828 8 6801 5 6016 0 00000 0 00000 5 4571 0 00000 0 80260 0 00000 4 4132 13 0670 8 1021 8 0230 4 3211 0 00000

    The applicable model is yi i yi yi if i 0 0 otherwise i N 0 2 Exercises 1 through 4 in this section are based on the preceding information The OLS estimator of in the context of this tobit model is simply the sample mean Compute the mean of all 20 observations Would you expect this estimator to overor underestimate If we consider only the nonzero observations then the truncated regression model applies The sample mean of the nonlimit observations is the least squares estimator in this context Compute it and then comment on whether this sample mean should be an overestimate or an underestimate of the true mean We now consider the tobit model that applies to the full data set a Formulate the log likelihood for this very simple tobit model b Reformulate the log likelihood in terms of 1 and Then derive the necessary conditions for maximizing the log likelihood with respect to and c Discuss how you would obtain the values of and to solve the problem in Part b d Compute the maximum likelihood estimates of and Using only the nonlimit observations repeat Exercise 2 in the context of the truncated regression model Estimate and by using the method of moments estimator outlined in Example 22 2 Compare your results with those in the previous exercises Continuing to use the data in Exercise 1 consider once again only the nonzero observations Suppose that the sampling mechanism is as follows y and another normally distributed random variable z have population correlation 0 7 The two variables y and z are sampled jointly When z is greater than zero y is reported When z is less than zero both z and y are discarded Exactly 35 draws were required to obtain the preceding sample Estimate and Hint Use Theorem 22 5 Derive the marginal effects for the tobit model with heteroscedasticity that is described in Section 22 3 4 a Prove that the Hessian for the tobit model in 22 14 is negative de nite after Olsen s transformation is applied to the parameters

    2

    3

    4

    5 6

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A

    Q
    MATRIX ALGEBRA
    A 1 TERMINOLOGY
    A matrix is a rectangular array of numbers denoted a11 a21 A aik A ik an1



    a12 a22 an2

    a1 K a2 K anK



    A 1

    The typical element is used to denote the matrix A subscripted element of a matrix is always read as arow column An example is given in Table A 1 In these data the rows are identi ed with years and the columns with particular variables A vector is an ordered set of numbers arranged either in a row or a column In view of the preceding a row vector is also a matrix with one row whereas a column vector is a matrix with one column Thus in Table A 1 the ve variables observed for 1972 including the date constitute a row vector whereas the time series of nine values for consumption is a column vector A matrix can also be viewed as a set of column vectors or as a set of row vectors 1 The dimensions of a matrix are the numbers of rows and columns it contains A is an n K matrix read n by K will always mean that A has n rows and K columns If n equals K then A is a square matrix Several particular types of square matrices occur frequently in econometrics



    A symmetric matrix is one in which aik aki for all i and k A diagonal matrix is a square matrix whose only nonzero elements appear on the main diagonal that is moving from upper left to lower right A scalar matrix is a diagonal matrix with the same value in all diagonal elements An identity matrix is a scalar matrix with ones on the diagonal This matrix is always denoted I A subscript is sometimes included to indicate its size or order For example A triangular matrix is one that has only zeros either above or below the main diagonal If the zeros are above the diagonal the matrix is lower triangular

    A 2

    ALGEBRAIC MANIPULATION OF MATRICES
    EQUALITY OF MATRICES

    A 2 1

    Matrices or vectors A and B are equal if and only if they have the same dimensions and each element of A equals the corresponding element of B That is A B if and only if aik bik for all i and k A 2

    1 Henceforth we shall denote a matrix by a boldfaced capital letter as is A in A 1 and a vector as a boldfaced

    lowercase letter as in a Unless otherwise noted a vector will always be assumed to be a column vector

    803

    Greene 50240

    book

    June 28 2002

    14 40

    804

    APPENDIX A Matrix Algebra

    TABLE A 1

    Matrix of Macroeconomic Data
    Column 2 Consumption billions of dollars 3 GNP billions of dollars 5 Discount Rate N Y Fed avg

    Row

    1 Year

    4 GNP De ator

    1 2 3 4 5 6 7 8 9

    1972 1973 1974 1975 1976 1977 1978 1979 1980

    737 1 812 0 808 1 976 4 1084 3 1204 4 1346 5 1507 2 1667 2

    1185 9 1326 4 1434 2 1549 2 1718 0 1918 3 2163 9 2417 8 2633 1

    1 0000 1 0575 1 1508 1 2579 1 3234 1 4005 1 5042 1 6342 1 7864

    4 50 6 44 7 83 6 25 5 50 5 46 7 46 10 28 11 77

    Source Data from the Economic Report of the President Washington D C U S Government Printing Of ce 1983 A 2 2 TRANSPOSITION

    The transpose of a matrix A denoted A is obtained by creating the matrix whose k th row is the k th column of the original matrix Thus if B A then each column of A will appear as the corresponding row of B If A is n K then A is K n An equivalent de nition of the transpose of a matrix is B A bik aki The de nition of a symmetric matrix implies that if and only if A is symmetric then A A It also follows from the de nition that for any A A A Finally the transpose of a column vector a is a row vector a a1 a2 an
    A 2 3 MATRIX ADDITION

    for all i and k

    A 3

    A 4

    A 5

    The operations of addition and subtraction are extended to matrices by de ning C A B aik bik A B aik bik A 6 A 7

    Matrices cannot be added unless they have the same dimensions in which case they are said to be conformable for addition A zero matrix or null matrix is one whose elements are all zero In the addition of matrices the zero matrix plays the same role as the scalar 0 in scalar addition that is A 0 A It follows from A 6 that matrix addition is commutative A B B A A 9 A 8

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    805

    and associative A B C A B C and that A B A B
    A 2 4 VECTOR MULTIPLICATION

    A 10

    A 11

    Matrices are multiplied by using the inner product The inner product or dot product of two vectors a and b is a scalar and is written a b a1 b1 a2 b2 an bn A 12

    Note that the inner product is written as the transpose of vector a times vector b a row vector times a column vector In A 12 each term a j b j equals b j a j hence a b b a
    A 2 5 A NOTATION FOR ROWS AND COLUMNS OF A MATRIX

    A 13

    We need a notation for the i th row of a matrix Throughout this book an untransposed vector will always be a column vector However we will often require a notation for the column vector that is the transpose of a row of a matrix This has the potential to create some ambiguity but the following convention based on the subscripts will suf ce for our work throughout this text



    ak or al or am will denote column k l or m of the matrix A ai or a j or at or as will denote the column vector formed by the transpose of row i j t or s of matrix A Thus ai is row i of A

    A 14

    For example from the data in Table A 1 it might be convenient to speak of xi 1972 as the 5 1 vector containing the ve variables measured for the year 1972 that is the transpose of the 1972 row of the matrix In our applications the common association of subscripts i and j with individual i or j and t and s with time periods t and s will be natural
    A 2 6 MATRIX MULTIPLICATION AND SCALAR MULTIPLICATION

    For an n K matrix A and a K M matrix B the product matrix C AB is an n M matrix whose ik th element is the inner product of row i of A and column k of B Thus the product matrix C is C AB cik ai bk A 15

    Note our use of A 14 in A 15 To multiply two matrices the number of columns in the rst must be the same as the number of rows in the second in which case they are conformable for multiplication 2 Multiplication of matrices is generally not commutative In some cases AB may exist but BA may be unde ned or if it does exist may have different dimensions In general however even if AB and BA do have the same dimensions they will not be equal In view of this we de ne premultiplication and postmultiplication of matrices In the product AB B is premultiplied by A whereas A is postmultiplied by B
    of the operation for example n K times K M The inner dimensions must be equal the result has dimensions equal to the outer values
    2 A simple way to check the conformability of two matrices for multiplication is to write down the dimensions

    Greene 50240

    book

    June 28 2002

    14 40

    806

    APPENDIX A Matrix Algebra

    Scalar multiplication of a matrix is the operation of multiplying every element of the matrix by a given scalar For scalar c and matrix A cA caik The product of a matrix and a vector is written c Ab The number of elements in b must equal the number of columns in A the result is a vector with a number of elements equal to the number of rows in A For example A 16



    5 4 4 2 1 1



    2 6 1

    1a 1 b 0 c



    We can interpret this in two ways First it is a compact way of writing the three equations 5 4a 2b 1c 4 2a 6b 1c 1 1a 1b 0c Second by writing the set of equations as



    5 4 2 1 4 a 2 b 6 c 1 1 1 1 0 we see that the right hand side is a linear combination of the columns of the matrix where the coef cients are the elements of the vector For the general case c Ab b1 a1 b2 a2 bK a K A 17







    In the calculation of a matrix product C AB each column of C is a linear combination of the columns of A where the coef cients are the elements in the corresponding column of B That is C AB ck Abk A 18

    Let ek be a column vector that has zeros everywhere except for a one in the kth position Then Aek is a linear combination of the columns of A in which the coef cient on every column but the kth is zero whereas that on the kth is one The result is ak Aek Combining this result with A 17 produces a1 a2 an A e1 e2 en AI A A 20 A 19

    In matrix multiplication the identity matrix is analogous to the scalar 1 For any matrix or vector A AI A In addition IA A although if A is not a square matrix the two identity matrices are of different orders A conformable matrix of zeros produces the expected result A0 0 Some general rules for matrix multiplication are as follows



    Associative law AB C A BC Distributive law A B C AB AC

    A 21 A 22

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    807 A 23 A 24



    Transpose of a product AB B A Transpose of an extended product ABC C B A
    SUMS OF VALUES

    A 2 7

    Denote by i a vector that contains a column of ones Then
    n

    xi x1 x2 xn i x
    i 1

    A 25

    If all elements in x are equal to the same constant a then x a i and
    n

    xi i a i a i i na
    i 1

    A 26

    For any constant a and vector x
    n n

    axi a
    i 1 i 1

    xi a i x

    A 27

    If a 1 n then we obtain the arithmetic mean x from which it follows that
    n

    1 n

    n

    xi
    i 1

    1 i x n

    A 28

    xi i x nx
    i 1

    The sum of squares of the elements in a vector x is
    n

    xi2 x x
    i 1

    A 29

    while the sum of the products of the n elements in vectors x and y is
    n

    xi yi x y
    i 1

    A 30

    By the de nition of matrix multiplication X X kl xkxl A 31

    is the inner product of the kth and l th columns of X For example for the data set given in Table A 1 if we de ne X as the 9 3 matrix containing year consumption GNP then
    1980

    X X 23
    t 1972

    consumptiont GNPt 737 1 1185 9 1667 2 2633 1

    19 743 711 34 If X is n K then again using A 14
    n

    XX
    i 1

    xi xi

    Greene 50240

    book

    June 28 2002

    14 40

    808

    APPENDIX A Matrix Algebra

    This form shows that the K K matrix X X is the sum of n K K matrices each formed from a single row year of X For the example given earlier this sum is of nine 3 3 matrices each formed from one row year of the original data matrix
    A 2 8 A USEFUL IDEMPOTENT MATRIX

    A fundamental matrix in statistics is the one that is used to transform data to deviations from their mean First


    x

    x 1 1 ix i i x ii x n n
    x

    A 32

    The matrix 1 n ii is an n n matrix with every element equal to 1 n The set of values in deviations form is



    x1 x 1 x2 x x ix x ii x n xn x Since x Ix x 1 1 1 ii x Ix ii x I ii x M0 x n n n



    A 33

    A 34

    Henceforth the symbol M0 will be used only for this matrix Its diagonal elements are all 1 1 n and its off diagonal elements are 1 n The matrix M0 is primarily useful in computing sums of squared deviations Some computations are simpli ed by the result M0 i I 1 1 ii i i i i i 0 n n

    which implies that i M0 0 The sum of deviations about the mean is then
    n

    xi x i M0 x 0 x 0
    i 1

    A 35

    For a single variable x the sum of squared deviations about the mean is
    n n

    xi x 2
    i 1 i 1

    xi2

    nx 2

    A 36

    In matrix terms
    n

    xi x 2 x x i x x i M0 x M0 x x M0 M0 x
    i 1

    Two properties of M0 are useful at this point First since all off diagonal elements of M0 equal 1 n M0 is symmetric Second as can easily be veri ed by multiplication M0 is equal to its square M0 M0 M0

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    809

    DEFINITION A 1 Idempotent Matrix
    An idempotent matrix M is one that is equal to its square that is M2 MM M If M is a symmetric idempotent matrix all of the idempotent matrices we shall encounter are asymmetric then M M M

    Thus M0 is a symmetric idempotent matrix Combining results we obtain
    n

    xi x 2 x M0 x
    i 1

    A 37

    Consider constructing a matrix of sums of squares and cross products in deviations from the column means For two vectors x and y
    n

    xi x yi y M0 x M0 y
    i 1

    A 38

    so


    n

    n

    n


    xi x yi y
    n

    xi x 2
    i 1 i 1

    yi y xi x
    i 1 i 1

    yi y 2

    x M0 x x M0 y y M0 x y M0 y

    A 39

    If we put the two column vectors x and y in an n 2 matrix Z x y then M0 Z is the n 2 matrix in which the two columns of data are in mean deviation form Then M0 Z M0 Z Z M0 M0 Z Z M0 Z

    A 3

    GEOMETRY OF MATRICES
    VECTOR SPACES

    A 3 1

    The K elements of a column vector

    a1 a2 a aK can be viewed as the coordinates of a point in a K dimensional space as shown in Figure A 1 for two dimensions or as the de nition of the line segment connecting the origin and the point de ned by a Two basic arithmetic operations are de ned for vectors scalar multiplication and addition A scalar multiple of a vector a is another vector say a whose coordinates are the scalar multiple of a s coordinates Thus in Figure A 1 a 1 2 a 2a 2 4 1 1 2 a a 2 1





    Greene 50240

    book

    June 28 2002

    14 40

    810

    APPENDIX A Matrix Algebra

    5

    4 Second coordinate

    3

    a

    2 a 1 b c

    1

    a 1

    1

    2 3 First coordinate

    4

    FIGURE A 1

    Vector Space

    The set of all possible scalar multiples of a is the line through the origin 0 and a Any scalar multiple of a is a segment of this line The sum of two vectors a and b is a third vector whose coordinates are the sums of the corresponding coordinates of a and b For example c a b 1 2 3 2 1 3

    Geometrically c is obtained by moving in the distance and direction de ned by b from the tip of a or because addition is commutative from the tip of b in the distance and direction of a The two dimensional plane is the set of all vectors with two real valued coordinates We label this set R2 R two not R squared It has two important properties



    R2 is closed under scalar multiplication every scalar multiple of a vector in R2 is also in R2 R2 is closed under addition the sum of any two vectors in the plane is always a vector in R2

    DEFINITION A 2 Vector Space
    A vector space is any set of vectors that is closed under scalar multiplication and addition

    Another example is the set of all real numbers that is R1 that is the set of vectors with one real element In general that set of K element vectors all of whose elements are real numbers is a K dimensional vector space denoted R K The preceding examples are drawn in R2

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra A 3 2

    811

    LINEAR COMBINATIONS OF VECTORS AND BASIS VECTORS

    In Figure A 1 c a b and d a b But since a 2a d 2a b Also e a 2b and f b a b a As this exercise suggests any vector in R2 could be obtained as a linear combination of a and b

    DEFINITION A 3 Basis Vectors
    A set of vectors in a vector space is a basis for that vector space if any vector in the vector space can be written as a linear combination of that set of vectors

    As is suggested by Figure A 1 any pair of two element vectors including a and b that point in different directions will form a basis for R2 Consider an arbitrary set of vectors in R2 a b and c If a and b are a basis then we can nd numbers 1 and 2 such that c 1 a 2 b Let a Then c1 1 a1 2 b1 c2 1 a2 2 b2 The solutions to this pair of equations are 1 b2 c1 b1 c2 a1 b2 b1 a2 2 a1 c2 a2 c1 a1 b2 b1 a2 A 41 A 40 a1 a2 b b1 b2 c c1 c2

    This result gives a unique solution unless a1 b2 b1 a2 0 If a1 b2 b1 a2 0 then a1 a2 b1 b2 which means that b is just a multiple of a This returns us to our original condition that a and b must point in different directions The implication is that if a and b are any pair of vectors for which the denominator in A 41 is not zero then any other vector c can be formed as a unique linear combination of a and b The basis of a vector space is not unique since any set of vectors that satis es the de nition will do But for any particular basis only one linear combination of them will produce another particular vector in the vector space
    A 3 3 LINEAR DEPENDENCE

    As the preceding should suggest K vectors are required to form a basis for R K Although the basis for a vector space is not unique not every set of K vectors will suf ce In Figure A 2 a and b form a basis for R2 but a and a do not The difference between these two pairs is that a and b are linearly independent whereas a and a are linearly dependent

    DEFINITION A 4 Linear Dependence
    A set of vectors is linearly dependent if any one of the vectors in the set can be written as a linear combination of the others

    Greene 50240

    book

    June 28 2002

    14 40

    812

    APPENDIX A Matrix Algebra

    d 5 e

    4 Second coordinate

    a

    3

    c

    2

    a

    1

    b

    1 1
    FIGURE A 2

    2 3 First coordinate

    4

    5

    f
    Linear Combinations of Vectors

    Since a is a multiple of a a and a are linearly dependent For another example if a then 1 2a b c 0 2 so a b and c are linearly dependent Any of the three possible pairs of them however are linearly independent 1 2 b 3 3 and c 10 14

    DEFINITION A 5 Linear Independence
    A set of vectors is linearly independent if and only if the only solution to 1 a1 2 a2 K a K 0 is 1 2 K 0

    The preceding implies the following equivalent de nition of a basis

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    813

    DEFINITION A 6 Basis for a Vector Space
    A basis for a vector space of K dimensions is any set of K linearly independent vectors in that vector space

    Since any K 1 st vector can be written as a linear combination of the K basis vectors it follows that any set of more than K vectors in R K must be linearly dependent
    A 3 4 SUBSPACES

    DEFINITION A 7 Spanning Vectors
    The set of all linear combinations of a set of vectors is the vector space that is spanned by those vectors

    For example by de nition the space spanned by a basis for R K is R K An implication of this is that if a and b are a basis for R2 and c is another vector in R2 the space spanned by a b c is again R2 Of course c is super uous Nonetheless any vector in R2 can be expressed as a linear combination of a b and c The linear combination will not be unique Suppose for example that a and c are also a basis for R2 Consider the set of three coordinate vectors whose third element is zero In particular a a1 a2 0 and b b1 b2 0

    Vectors a and b do not span the three dimensional space R3 Every linear combination of a and b has a third coordinate equal to zero thus for instance c 1 2 3 could not be written as a linear combination of a and b If a1 b2 a2 b1 is not equal to zero see A 41 however then any vector whose third element is zero can be expressed as a linear combination of a and b So although a and b do not span R3 they do span something the set of vectors in R3 whose third element is zero This area is a plane the oor of the box in a three dimensional gure This plane in R3 is a subspace in this instance a two dimensional subspace Note that it is not R2 it is the set of vectors in R3 whose third coordinate is 0 Any plane in R3 regardless of how it is oriented forms a two dimensional subspace Any two independent vectors that lie in that subspace will span it But without a third vector that points in some other direction we cannot span any more of R3 than this two dimensional part of it By the same logic any line in R3 is a one dimensional subspace in this case the set of all vectors in R3 whose coordinates are multiples of those of the vector that de ne the line A subspace is a vector space in all the respects in which we have de ned it We emphasize that it is not a vector space of lower dimension For example R2 is not a subspace of R3 The essential difference is the number of dimensions in the vectors The vectors in R3 that form a two dimensional subspace are still three element vectors they all just happen to lie in the same plane The space spanned by a set of vectors in R K has at most K dimensions If this space has fewer than K dimensions it is a subspace or hyperplane But the important point in the preceding discussion is that every set of vectors spans some space it may be the entire space in which the vectors reside or it may be some subspace of it

    Greene 50240

    book

    June 28 2002

    14 40

    814

    APPENDIX A Matrix Algebra A 3 5 RANK OF A MATRIX

    We view a matrix as a set of column vectors The number of columns in the matrix equals the number of vectors in the set and the number of rows equals the number of coordinates in each column vector

    DEFINITION A 8 Column Space
    The column space of a matrix is the vector space that is spanned by its column vectors

    If the matrix contains K rows its column space might have K dimensions But as we have seen it might have fewer dimensions the column vectors might be linearly dependent or there might be fewer than K of them Consider the matrix 15 A 2 6 71



    6 8 8



    It contains three vectors from R3 but the third is the sum of the rst two so the column space of this matrix cannot have three dimensions Nor does it have only one since the three columns are not all scalar multiples of one another Hence it has two and the column space of this matrix is a two dimensional subspace of R3

    DEFINITION A 9 Column Rank
    The column rank of a matrix is the dimension of the vector space that is spanned by its column vectors

    It follows that the column rank of a matrix is equal to the largest number of linearly independent column vectors it contains The column rank of A is 2 For another speci c example consider 1 5 B 6 3



    2 1 4 1

    3 5 5 4



    It can be shown we shall see how later that this matrix has a column rank equal to 3 Since each column of B is a vector in R4 the column space of B is a three dimensional subspace of R4 Consider instead the set of vectors obtained by using the rows of B instead of the columns The new matrix would be 15 C 2 1 35



    6 4 5

    3 1 4



    This matrix is composed of four column vectors from R3 Note that C is B The column space of C is at most R3 since four vectors in R3 must be linearly dependent In fact the column space of

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    815

    C is R3 Although this is not the same as the column space of B it does have the same dimension Thus the column rank of C and the column rank of B are the same But the columns of C are the rows of B Thus the column rank of C equals the row rank of B That the column and row ranks of B are the same is not a coincidence The general results which are equivalent are as follows

    THEOREM A 1 Equality of Row and Column Rank
    The column rank and row rank of a matrix are equal By the de nition of row rank and its counterpart for column rank we obtain the corollary the row space and column space of a matrix have the same dimension A 42

    Theorem A 1 holds regardless of the actual row and column rank If the column rank of a matrix happens to equal the number of columns it contains then the matrix is said to have full column rank Full row rank is de ned likewise Since the row and column ranks of a matrix are always equal we can speak unambiguously of the rank of a matrix For either the row rank or the column rank and at this point we shall drop the distinction rank A rank A min number of rows number of columns A 43

    In most contexts we shall be interested in the columns of the matrices we manipulate We shall use the term full rank to describe a matrix whose rank is equal to the number of columns it contains Of particular interest will be the distinction between full rank and short rank matrices The distinction turns on the solutions to Ax 0 If a nonzero x for which Ax 0 exists then A does not have full rank Equivalently if the nonzero x exists then the columns of A are linearly dependent and at least one of them can be expressed as a linear combination of the others For example a nonzero set of solutions to 1 2 3 3 x1 10 0 x2 14 0 x3



    is any multiple of x 2 1 1 2 In a product matrix C AB every column of C is a linear combination of the columns of A so each column of C is in the column space of A It is possible that the set of columns in C could span this space but it is not possible for them to span a higher dimensional space At best they could be a full set of linearly independent vectors in A s column space We conclude that the column rank of C could not be greater than that of A Now apply the same logic to the rows of C which are all linear combinations of the rows of B For the same reason that the column rank of C cannot exceed the column rank of A the row rank of C cannot exceed the row rank of B Since row and column ranks are always equal we conclude that rank AB min rank A rank B A useful corollary of A 44 is If A is M n and B is a square matrix of rank n then rank AB rank A A 45 A 44

    Greene 50240

    book

    June 28 2002

    14 40

    816

    APPENDIX A Matrix Algebra

    Another application that plays a central role in the development of regression analysis is for any matrix A rank A rank A A rank AA
    A 3 6 DETERMINANT OF A MATRIX

    A 46

    The determinant of a square matrix determinants are not de ned for nonsquare matrices is a function of the elements of the matrix There are various de nitions most of which are not useful for our work Determinants gure into our results in several ways however that we can enumerate before we need formally to de ne the computations

    PROPOSITION
    The determinant of a matrix is nonzero if and only if it has full rank

    Full rank and short rank matrices can be distinguished by whether or not their determinants are nonzero There are some settings in which the value of the determinant is also of interest so we now consider some algebraic results It is most convenient to begin with a diagonal matrix d1 0 D



    0 d2 0

    0

    0 0 0 0 0 dK



    The column vectors of D de ne a box in R K whose sides are all at right angles to one another 3 Its volume or determinant is simply the product of the lengths of the sides which we denote
    K

    D d1 d2 dK
    k 1

    dk

    A 47

    A special case is the identity matrix which has regardless of K I K 1 Multiplying D by a scalar c is equivalent to multiplying the length of each side of the box by c which would multiply its volume by c K Thus cD c K D A 48

    Continuing with this admittedly special case we suppose that only one column of D is multiplied by c In two dimensions this would make the box wider but not higher or vice versa Hence the volume area would also be multiplied by c Now suppose that each side of the box were multiplied by a different c the rst by c1 the second by c2 and so on The volume would by an obvious extension now be c1 c2 c K D The matrix with columns de ned by c1 d1 c2 d2 is just DC where C is a diagonal matrix with ci as its ith diagonal element The computation just described is therefore DC D C A 49

    The determinant of C is the product of the ci s since C like D is a diagonal matrix In particular note what happens to the whole thing if one of the ci s is zero
    3 Each

    column vector de nes a segment on one of the axes

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    817

    For 2 2 matrices the computation of the determinant is ac ad bc bd A 50

    Notice that it is a function of all the elements of the matrix This statement will be true in general For more than two dimensions the determinant can be obtained by using an expansion by cofactors Using any row say i we obtain
    K

    A
    k 1

    aik 1 i k Aik

    k 1 K

    A 51

    where Aik is the matrix obtained from A by deleting row i and column k The determinant of Aik is called a minor of A 4 When the correct sign 1 i k is added it becomes a cofactor This operation can be done using any column as well For example a 4 4 determinant becomes a sum of four 3 3s whereas a 5 5 is a sum of ve 4 4s each of which is a sum of four 3 3s and so on Obviously it is a good idea to base A 51 on a row or column with many zeros in it if possible In practice this rapidly becomes a heavy burden It is unlikely though that you will ever calculate any determinants over 3 3 without a computer A 3 3 however might be computed on occasion if so the following shortcut will prove useful a11 a21 a31 a12 a22 a32 a13 a23 a11 a22 a33 a12 a23 a31 a13 a32 a21 a31 a22 a13 a21 a12 a33 a11 a23 a32 a33

    Although A 48 and A 49 were given for diagonal matrices they hold for general matrices C and D One special case of A 48 to note is that of c 1 Multiplying a matrix by 1 does not necessarily change the sign of its determinant It does so only if the order of the matrix is odd By using the expansion by cofactors formula an additional result can be shown A A
    A 3 7 A LEAST SQUARES PROBLEM

    A 52

    Given a vector y and a matrix X we are interested in expressing y as a linear combination of the columns of X There are two possibilities If y lies in the column space of X then we shall be able to nd a vector b such that y Xb A 53

    Figure A 3 illustrates such a case for three dimensions in which the two columns of X both have a third coordinate equal to zero Only y s whose third coordinate is zero such as y0 in the gure can be expressed as Xb for some b For the general case assuming that y is indeed in the column space of X we can nd the coef cients b by solving the set of equations in A 53 The solution is discussed in the next section Suppose however that y is not in the column space of X In the context of this example suppose that y s third component is not zero Then there is no b such that A 53 holds We can however write y Xb e A 54

    where e is the difference between y and Xb By this construction we nd an Xb that is in the column space of X and e is the difference or residual Figure A 3 shows two examples y and
    4 If

    i equals j then the determinant is a principal minor

    Greene 50240

    book

    June 28 2002

    14 40

    818

    APPENDIX A Matrix Algebra

    Second coordinate X1 y y Third coordinate e Xb e



    y0 Xb X2

    First coordinate
    FIGURE A 3 Least Squares Projections

    y For the present we consider only y We are interested in nding the b such that y is as close as possible to Xb in the sense that e is as short as possible

    DEFINITION A 10 Length of a Vector
    The length or norm of a vector e is e e e A 55

    The problem is to nd the b for which e y Xb is as small as possible The solution is that b that makes e perpendicular or orthogonal to Xb

    DEFINITION A 11 Orthogonal Vectors
    Two nonzero vectors a and b are orthogonal written a b if and only if a b b a 0

    Returning once again to our tting problem we nd that the b we seek is that for which e Xb Expanding this set of equations gives the requirement Xb e 0 b X y b X Xb b X y X Xb

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    819

    or assuming b is not 0 the set of equations X y X Xb The means of solving such a set of equations is the subject of Section A 5 In Figure A 3 the linear combination Xb is called the projection of y into the column space of X The gure is drawn so that although y and y are different they are similar in that the projection of y lies on top of that of y The question we wish to pursue here is Which vector y or y is closer to its projection in the column space of X Super cially it would appear that y is closer since e is shorter than e Yet y is much more nearly parallel to its projection than y so the only reason that its residual vector is longer is that y is longer compared with y A measure of comparison that would be unaffected by the length of the vectors is the angle between the vector and its projection assuming that angle is not zero By this measure is considerably smaller than which would reverse the earlier conclusion

    THEOREM A 2 The Cosine Law
    The angle between two vectors a and b satis es cos ab a b

    The two vectors in the calculation would be y or y and Xb or Xb A zero cosine implies that the vectors are orthogonal If the cosine is one then the angle is zero which means that the vectors are the same They would be if y were in the column space of X By dividing by the lengths we automatically compensate for the length of y By this measure we nd in Figure A 3 that y is closer to its projection Xb than y is to its projection Xb

    A 4

    SOLUTION OF A SYSTEM OF LINEAR EQUATIONS

    Consider the set of n linear equations Ax b A 56

    in which the K elements of x constitute the unknowns A is a known matrix of coef cients and b is a speci ed vector of values We are interested in knowing whether a solution exists if so then how to obtain it and nally if it does exist then whether it is unique

    A 4 1

    SYSTEMS OF LINEAR EQUATIONS

    For most of our applications we shall consider only square systems of equations that is those in which A is a square matrix In what follows therefore we take n to equal K Since the number of rows in A is the number of equations whereas the number of columns in A is the number of variables this case is the familiar one of n equations in n unknowns There are two types of systems of equations

    Greene 50240

    book

    June 28 2002

    14 40

    820

    APPENDIX A Matrix Algebra

    DEFINITION A 12 Homogeneous Equation System
    A homogeneous system is of the form Ax 0

    By de nition a nonzero solution to such a system will exist if and only if A does not have full rank If so then for at least one column of A we can write the preceding as ak
    m k

    xm am xk

    This means as we know that the columns of A are linearly dependent and that A 0

    DEFINITION A 13 Nonhomogeneous Equation System
    A nonhomogeneous system of equations is of the form Ax b where b is a nonzero vector

    The vector b is chosen arbitrarily and is to be expressed as a linear combination of the columns of A Since b has K elements this situation will be possible only if the columns of A span the entire K dimensional space R K 5 Equivalently we shall require that the columns of A be linearly independent or that A not be equal to zero
    A 4 2 INVERSE MATRICES

    To solve the system Ax b for x something akin to division by a matrix is needed Suppose that we could nd a square matrix B such that BA I If the equation system is premultiplied by this B then the following would be obtained BAx Ix x Bb If the matrix B exists then it is the inverse of A denoted B A 1 From the de nition A 1 A I In addition by premultiplying by A postmultiplying by A 1 and then canceling terms we nd AA 1 I as well If the inverse exists then it must be unique Suppose that it is not and that C is a different inverse of A Then CAB CAB but CA B IB B and C AB C which would be a
    5 If A does not have full rank then the nonhomogeneous system will have solutions for some vectors b namely

    A 57

    any b in the column space of A But we are interested in the case in which there are solutions for all nonzero vectors b which requires A to have full rank

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    821

    contradiction if C did not equal B Since by A 57 the solution is x A 1 b the solution to the equation system is unique as well We now consider the calculation of the inverse matrix For a 2 2 matrix AB I implies that a11 a21 a12 a22 b11 b21 b12 10 01 b22 a11 b11 a12 b21 1 a11 b12 a12 b22 0 or a21 b11 a22 b21 0 a21 b12 a22 b22 1 The solutions are b11 b21 b12 b22 1 a22 a11 a22 a12 a21 a21 1 a12 a22 a11 A a21 a12 a11 A 58





    Notice the presence of the reciprocal of A in A 1 This situation is not speci c to the 2 2 case We infer from it that if the determinant is zero then the inverse does not exist

    DEFINITION A 14 Nonsingular Matrix
    A matrix is nonsingular if and only if its inverse exists

    The simplest inverse matrix to compute is that of a diagonal matrix If d1 0 If D 0



    0 d2 0

    0 0 0 0 0 dK



    then

    D 1

    1 d1 0



    0 1 d2 0

    0

    0 0 0 0 0 1 dK



    which shows incidentally that I 1 I We shall use a ik to indicate the ikth element of A 1 The general formula for computing an inverse matrix is a ik Cik A A 59

    where Cik is the ki th cofactor of A It follows therefore that for A to be nonsingular A must be nonzero Notice the reversal of the subscripts Some computational results involving inverses are A 1 1 A A 60 A 61 A 62 A 63

    A 1 1 A A 1 A 1 If A is symmetric then A 1 is symmetric When both inverse matrices exist AB 1 B 1 A 1

    A 64

    Greene 50240

    book

    June 28 2002

    14 40

    822

    APPENDIX A Matrix Algebra

    Note the condition preceding A 64 It may be that AB is a square nonsingular matrix when neither A nor B are even square Consider for example A A Extending A 64 we have ABC 1 C 1 AB 1 C 1 B 1 A 1 A 65

    Recall that for a data matrix X X X is the sum of the outer products of the rows X Suppose that we have already computed S X X 1 for a number of years of data such as those given at the beginning of this chapter The following result which is called an updating formula shows how to compute the new S that would result when a new row is added to X A bb 1 A 1 1 A 1 bb A 1 1 b A 1 b A 66

    Note the reversal of the sign in the inverse Two more general forms of A 66 that are occasionally useful are A bc 1 A 1 1 A 1 bc A 1 1 c A 1 b A 66a A 66b

    A BCB 1 A 1 A 1 B C 1 B A 1 B 1 B A 1
    A 4 3 NONHOMOGENEOUS SYSTEMS OF EQUATIONS

    For the nonhomogeneous system Ax b if A is nonsingular then the unique solution is x A 1 b
    A 4 4 SOLVING THE LEAST SQUARES PROBLEM

    We now have the tool needed to solve the least squares problem posed in Section A3 7 We found the solution vector b to be the solution to the nonhomogenous system X y X Xb Let z equal the vector X y and let A equal the square matrix X X The equation system is then Ab a By the results above if A is nonsingular then b A 1 a X X 1 X y assuming that the matrix to be inverted is nonsingular We have reached the irreducible minimum If the columns of X are linearly independent that is if X has full rank then this is the solution to the least squares problem If the columns of X are linearly dependent then this system has no unique solution

    A 5

    PARTITIONED MATRICES

    In formulating the elements of a matrix it is sometimes useful to group some of the elements in submatrices Let 14 A 2 9 89



    5 A11 3 A21 6



    A12 A22

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    823

    A is a partitioned matrix The subscripts of the submatrices are de ned in the same fashion as those for the elements of a matrix A common special case is the block diagonal matrix A where A11 and A22 are square matrices
    A 5 1 ADDITION AND MULTIPLICATION OF PARTITIONED MATRICES

    A11 0

    0 A22



    For conformably partitioned matrices A and B A B and AB A11 A21 A12 A22 B11 B21 B12 A11 B11 A12 B21 B22 A21 B11 A22 B21 A11 B12 A12 B22 A21 B12 A22 B22 A 68 A11 B11 A21 B21 A12 B12 A22 B22 A 67

    In all these the matrices must be conformable for the operations involved For addition the dimensions of Aik and Bik must be the same For multiplication the number of columns in Aik must equal the number of rows in B jl for all pairs i and k That is all the necessary matrix products of the submatrices must be de ned Two cases frequently encountered are of the form A1 A2 A11 0
    A 5 2

    A1 A1 A2 0 A22 A11 0

    A2

    A1 A1 A1 A2 A2 A2 0 A22 A22

    A 69

    0 A11 A11 A22 0

    A 70

    DETERMINANTS OF PARTITIONED MATRICES

    The determinant of a block diagonal matrix is obtained analogously to that of a diagonal matrix A11 0 For a general 2 2 partitioned matrix is A11 A21
    A 5 3

    0 A11 A22 A22

    A 71

    A12 A22 A11 A12 A 1 A21 A11 A22 A21 A 1 A12 22 11 A22

    A 72

    INVERSES OF PARTITIONED MATRICES

    The inverse of a block diagonal matrix is A11 0 0 A22
    1



    A 1 11 0

    0 A 1 22



    A 73

    which can be veri ed by direct multiplication

    Greene 50240

    book

    June 28 2002

    14 40

    824

    APPENDIX A Matrix Algebra

    For the general 2 2 partitioned matrix one form of the partitioned inverse is A11 A21 where F2 A22 A21 A 1 A12 11 The upper left block could also be written as F1 A11 A12 A 1 A21 22
    A 5 4 DEVIATIONS FROM MEANS
    1 1

    A12 A22

    1



    A 1 I A12 F2 A21 A 1 11 11 F2 A21 A 1 11

    A 1 A12 F2 11 F2



    A 74



    Suppose that we begin with a column vector of n values x and let



    n



    n A n
    i 1

    i 1 n

    xi xi

    xi
    i 1

    ii ix xi xx 2

    We are interested in the lower right hand element of A 1 Upon using the de nition of F2 in A 74 this is F2 x x x i i i 1 i x 1 x I 1 n
    1

    x

    Ix i

    1 n

    1

    ix

    ii

    x

    x M0 x 1

    Therefore the lower right hand value in the inverse matrix is x M0 x 1
    n i 1

    1 a 22 xi x 2

    Now suppose that we replace x with X a matrix with several columns We seek the lower right block of Z Z 1 where Z i X The analogous result is Z Z 22 X X X i i i 1 i X 1 X M0 X 1 which implies that the K K matrix in the lower right corner of Z Z 1 is the inverse of the n K K matrix whose jkth element is i 1 xi j x j xik x k Thus when a data matrix contains a column of ones the elements of the inverse of the matrix of sums of squares and cross products will be computed from the original data in the form of deviations from the respective column means
    A 5 5 KRONECKER PRODUCTS

    A calculation that helps to condense the notation when dealing with sets of regression models see Chapters 13 and 14 is the Kronecker product For general matrices A and B a1 K B a2 K B A B an1 B an2 B anK B a11 B a21 B a12 B a22 B





    A 75

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    825

    Notice that there is no requirement for conformability in this operation The Kronecker product can be computed for any pair of matrices If A is K L and B is m n then A B is Km Ln For the Kronecker product A B 1 A 1 B 1 If A is M M and B is n n then A B A n B M A B A B trace A B tr A tr B For A B C and D such that the products are de ned is A B C D AC BD A 76

    A 6

    CHARACTERISTIC ROOTS AND VECTORS

    A useful set of results for analyzing a square matrix A arises from the solutions to the set of equations Ac c A 77

    The pairs of solutions are the characteristic vectors c and characteristic roots If c is any solution vector then kc is also for any value of k To remove the indeterminancy c is normalized so that c c 1 The solution then consists of and the n 1 unknown elements in c
    A 6 1 THE CHARACTERISTIC EQUATION

    Solving A 77 can in principle proceed as follows First A 77 implies that Ac Ic or that A I c 0 This equation is a homogeneous system that has a nonzero solution only if the matrix A I is singular or has a zero determinant Therefore if is a solution then A I 0 This polynomial in is the characteristic equation of A For example if A then A I 5 2 1 5 4 2 1 2 9 18 4 5 2 1 4 A 78

    The two solutions are 6 and 3

    Greene 50240

    book

    June 28 2002

    14 40

    826

    APPENDIX A Matrix Algebra

    In solving the characteristic equation there is no guarantee that the characteristic roots will be real In the preceding example if the 2 in the lower left hand corner of the matrix were 2 instead then the solution would be a pair of complex values The same result can emerge in the general n n case The characteristic roots of a symmetric matrix are real however 6 This result will be convenient because most of our applications will involve the characteristic roots and vectors of symmetric matrices For an n n matrix the characteristic equation is an nth order polynomial in Its solutions may be n distinct values as in the preceding example or may contain repeated values of and may contain some zeros as well
    A 6 2 CHARACTERISTIC VECTORS

    With in hand the characteristic vectors are derived from the original problem Ac c or A I c 0 A 79

    Neither pair determines the values of c1 and c2 But this result was to be expected it was the reason c c 1 was speci ed at the outset The additional equation c c 1 however produces complete solutions for the vectors
    A 6 3 GENERAL RESULTS FOR CHARACTERISTIC ROOTS AND VECTORS

    A K K symmetric matrix has K distinct characteristic vectors c1 c2 c K The corresponding characteristic roots 1 2 K although real need not be distinct The characteristic vectors of a symmetric matrix are orthogonal 7 which implies that for every i j ci c j 0 8 It is convenient to collect the K characteristic vectors in a K K matrix whose ith column is the ci corresponding to i C c1 c2 c K

    and the K characteristic roots in the same order in a diagonal matrix





    1 0 0

    0 2 0

    0 0 0 0 0 K



    Then the full set of equations Ack kck is contained in AC C
    6A

    A 80

    proof may be found in Theil 1971 proofs of these propositions see Strang 1998

    7 For

    8 This

    statement is not true if the matrix is not symmetric For instance it does not hold for the characteristic vectors computed in the rst example For nonsymmetric matrices there is also a distinction between right characteristic vectors Ac c and left characteristic vectors d A d which may not be equal

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    827

    Since the vectors are orthogonal and ci ci 1 we have



    CC

    c1 c1 c2 c1

    c1 c2 c2 c2

    c K c1 Result A 81 implies that

    c K c2

    c1 c K c2 c K I cK cK



    A 81

    C C 1 Consequently CC CC 1 I as well so the rows as well as the columns of C are orthogonal
    A 6 4 DIAGONALIZATION AND SPECTRAL DECOMPOSITION OF A MATRIX

    A 82

    A 83

    By premultiplying A 80 by C and using A 81 we can extract the characteristic roots of A

    DEFINITION A 15 Diagonalization of a Matrix
    The diagonalization of a matrix A is C AC C C I A 84

    Alternatively by post multiplying A 80 by C and using A 83 we obtain a useful representation of A

    DEFINITION A 16 Spectral Decomposition of a Matrix
    The spectral decomposition of A is
    K

    A C C
    k 1

    kckck

    A 85

    In this representation the K K matrix A is written as a sum of K rank one matrices This sum is also called the eigenvalue or own value decomposition of A In this connection the term signature of the matrix is sometimes used to describe the characteristic roots and vectors Yet another pair of terms for the parts of this decomposition are the latent roots and latent vectors of A
    A 6 5 RANK OF A MATRIX

    The diagonalization result enables us to obtain the rank of a matrix very easily To do so we can use the following result

    Greene 50240

    book

    June 28 2002

    14 40

    828

    APPENDIX A Matrix Algebra

    THEOREM A 3 Rank of a Product
    For any matrix A and nonsingular matrices B and C the rank of BAC is equal to the rank of A The proof is simple By A 45 rank BAC rank BA C rank BA By A 43 rank BA rank A B and applying A 45 again rank A B rank A since B is nonsingular if B is nonsingular once again by A 43 Finally applying A 43 again to obtain rank A rank A gives the result

    Since C and C are nonsingular we can use them to apply this result to A 84 By an obvious substitution rank A rank A 86

    Finding the rank of is trivial Since is a diagonal matrix its rank is just the number of nonzero values on its diagonal By extending this result we can prove the following theorems Proofs are brief and are left for the reader

    THEOREM A 4 Rank of a Symmetric Matrix
    The rank of a symmetric matrix is the number of nonzero characteristic roots it contains

    Note how this result enters the spectral decomposition given above If any of the characteristic roots are zero then the number of rank one matrices in the sum is reduced correspondingly It would appear that this simple rule will not be useful if A is not square But recall that rank A rank A A A 87

    Since A A is always square we can use it instead of A Indeed we can use it even if A is square which leads to a fully general result

    THEOREM A 5 Rank of a Matrix
    The rank of any matrix A equals the number of nonzero characteristic roots in A A

    Since the row rank and column rank of a matrix are equal we should be able to apply Theorem A 5 to AA as well This process however requires an additional result

    THEOREM A 6 Roots of an Outer Product Matrix
    The nonzero characteristic roots of AA are the same as those of A A

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    829

    The proof is left as an exercise A useful special case the reader can examine is the characteristic roots of aa and a a where a is an n 1 vector If a characteristic root of a matrix is zero then we have Ac 0 Thus if the matrix has a zero root it must be singular Otherwise no nonzero c would exist In general therefore a matrix is singular that is it does not have full rank if and only if it has at least one zero root
    A 6 6 CONDITION NUMBER OF A MATRIX

    As the preceding might suggest there is a discrete difference between full rank and short rank matrices In analyzing data matrices such as the one in Section A 2 however we shall often encounter cases in which a matrix is not quite short ranked because it has all nonzero roots but it is close That is by some measure we can come very close to being able to write one column as a linear combination of the others This case is important we shall examine it at length in our discussion of multicollinearity Our de nitions of rank and determinant will fail to indicate this possibility but an alternative measure the condition number is designed for that purpose Formally the condition number for a square matrix A is maximum root minimum root
    1 2



    A 88

    For nonsquare matrices X such as the data matrix in the example we use A X X As a further re nement because the characteristic roots are affected by the scaling of the columns of X we scale the columns to have length 1 by dividing each column by its norm see A 55 For the X in Section A 2 the largest characteristic root of A is 4 9255 and the smallest is 0 0001543 Therefore the condition number is 178 67 which is extremely large Values greater than 20 are large That the smallest root is close to zero compared with the largest means that this matrix is nearly singular Matrices with large condition numbers are dif cult to invert accurately
    A 6 7 TRACE OF A MATRIX

    The trace of a square K K matrix is the sum of its diagonal elements
    K

    tr A
    k 1

    akk

    Some easily proven results are tr cA c tr A tr A tr A tr A B tr A tr B tr I K K tr AB tr BA a a tr a a tr aa
    K K K 2 aik i 1 k 1

    A 89 A 90 A 91 A 92 A 93

    tr A A
    k 1

    akak

    The permutation rule can be extended to any cyclic permutation in a product tr ABCD tr BCDA tr CDAB tr DABC A 94

    Greene 50240

    book

    June 28 2002

    14 40

    830

    APPENDIX A Matrix Algebra

    By using A 84 we obtain tr C AC tr ACC tr AI tr A tr Since A 95

    is diagonal with the roots of A on its diagonal the general result is the following

    THEOREM A 7 Trace of a Matrix
    The trace of a matrix equals the sum of its characteristic roots A 96

    A 6 8

    DETERMINANT OF A MATRIX

    Recalling how tedious the calculation of a determinant promised to be we nd that the following is particularly useful Since C AC A 97

    C AC Using a number of earlier results we have for orthogonal matrix C C AC C A C C C A C C A I A 1 A A Since is just the product of its diagonal elements the following is implied

    A 98

    THEOREM A 8 Determinant of a Matrix
    The determinant of a matrix equals the product of its characteristic roots A 99

    Notice that we get the expected result if any of these roots is zero Since the determinant is the product of the roots it follows that a matrix is singular if and only if its determinant is zero and in turn if and only if it has at least one zero characteristic root
    A 6 9 POWERS OF A MATRIX

    We often use expressions involving powers of matrices such as AA A2 For positive integer powers these expressions can be computed by repeated multiplication But this does not show how to handle a problem such as nding a B such that B2 A that is the square root of a matrix The characteristic roots and vectors provide a simple solution Consider rst AA A2 C C C C C C C C C I C C C
    2

    C A 100

    C

    Two results follow Since 2 is a diagonal matrix whose nonzero elements are the squares of those in the following is implied For any symmetric matrix the characteristic roots of A2 are the squares of those of A and the characteristic vectors are the same A 101

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    831

    The proof is obtained by observing that the last line in A 100 is the eigenvalue decomposition of the matrix B AA Since A3 AA2 and so on A 101 extends to any positive integer By convention for any A A0 I Thus for any symmetric matrix A A K C K C K 0 1 Hence the characteristic roots of A K are K whereas the characteristic vectors are the same as those of A If A is nonsingular so that all its roots i are nonzero then this proof can be extended to negative powers as well If A 1 exists then A 1 C C 1 C 1
    1

    C 1 C

    1

    C

    A 102

    where we have used the earlier result C C 1 which gives an important result that is useful for analyzing inverse matrices

    If A 1 exists then the characteristic roots of A 1 are the reciprocals of those of A and the characteristic vectors are the same

    THEOREM A 9 Characteristic Roots of an Inverse Matrix

    By extending the notion of repeated multiplication we now have a more general result

    THEOREM A 10 Characteristic Roots of a Matrix Power
    For any nonsingular symmetric matrix A C C A K C 1 0 1 2
    K

    C K 2

    We now turn to the general problem of how to compute the square root of a matrix In the scalar case the value would have to be nonnegative The matrix analog to this requirement is that all the characteristic roots are nonnegative Consider then the candidate 1 0 0 2 0 0 A1 2 C 1 2 C C A 103 C n 0 0 This equation satis es the requirement for a square root since A1 2 A1 2 C
    1 2

    CC

    1 2

    C C C A

    A 104

    If we continue in this fashion we can de ne the powers of a matrix more generally still assuming that all the characteristic roots are nonnegative For example A1 3 C 1 3 C If all the roots are strictly positive we can go one step further and extend the result to any real power For reasons that will be made clear in the next section we say that a matrix with positive characteristic roots is positive de nite It is the matrix analog to a positive number

    DEFINITION A 17 Real Powers of a Positive De nite Matrix
    For a positive de nite matrix A Ar C
    r

    C for any real number r

    A 105

    Greene 50240

    book

    June 28 2002

    14 40

    832

    APPENDIX A Matrix Algebra

    The characteristic roots of Ar are the r th power of those of A and the characteristic vectors are the same If A is only nonnegative de nite that is has roots that are either zero or positive then A 105 holds only for nonnegative r
    A 6 10 IDEMPOTENT MATRICES

    Idempotent matrices are equal to their squares see A 37 to A 39 In view of their importance in econometrics we collect a few results related to idempotent matrices at this point First A 101 implies that if is a characteristic root of an idempotent matrix then K for all nonnegative integers K As such if A is a symmetric idempotent matrix then all its roots are one or zero Assume that all the roots of A are one Then I and A C C CIC CC I If the roots are not all one then one or more are zero Consequently we have the following results for symmetric idempotent matrices 9



    The only full rank symmetric idempotent matrix is the identity matrix I All symmetric idempotent matrices except the identity matrix are singular

    A 106 A 107

    The nal result on idempotent matrices is obtained by observing that the count of the nonzero roots of A is also equal to their sum By combining Theorems A 5 and A 7 with the result that for an idempotent matrix the roots are all zero or one we obtain this result



    The rank of a symmetric idempotent matrix is equal to its trace
    FACTORING A MATRIX

    A 108

    A 6 11

    In some applications we shall require a matrix P such that P P A 1 One choice is P so that P P C
    1 2 1 2

    C



    1 2

    C C

    1

    C

    as desired 10 Thus the spectral decomposition of A A C C is a useful result for this kind of computation The Cholesky factorization of a symmetric positive de nite matrix is an alternative representation that is useful in regression analysis Any symmetric positive de nite matrix A may be written as the product of a lower triangular matrix L and its transpose which is an upper triangular matrix L U Thus A LU This result is the Cholesky decomposition of A The square roots of the diagonal elements of L di are the Cholesky values of A By arraying these in a diagonal matrix D we may also write A LD 1 D2 D 1 U L D2 U which is similar to the spectral decomposition in A 85 The usefulness of this formulation arises when the inverse of A is required Once L is
    9 Not

    all idempotent matrices are symmetric We shall not encounter any asymmetric ones in our work however say that this is one choice because if A is symmetric as it will be in all our applications there are other candidates The reader can easily verify that C 1 2 C A 1 2 works as well

    10 We

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    833

    computed nding A 1 U 1 L 1 is also straightforward as well as extremely fast and accurate Most recently developed econometric software packages use this technique for inverting positive de nite matrices A third type of decomposition of a matrix is useful for numerical analysis when the inverse is dif cult to obtain because the columns of A are nearly collinear Any n K matrix A for which n K can be written in the form A UWV where U is an orthogonal n K matrix that is U U I K W is a K K diagonal matrix such that wi 0 and V is a K K matrix such that V V I K This result is called the singular value decomposition SVD of A and wi are the singular values of A 11 Note that if A is square then the spectral decomposition is a singular value decomposition As with the Cholesky decomposition the usefulness of the SVD arises in inversion in this case of A A By multiplying it out we obtain that A A 1 is simply VW 2 V Once the SVD of A is computed the inversion is trivial The other advantage of this format is its numerical stability which is discussed at length in Press et al 1986 Press et al 1986 recommend the SVD approach as the method of choice for solving least squares problems because of its accuracy and numerical stability A commonly used alternative method similar to the SVD approach is the QR decomposition Any n K matrix X with n K can be written in the form X QR in which the columns of Q are orthonormal Q Q I and R is an upper triangular matrix Decomposing X in this fashion allows an extremely accurate solution to the least squares problem that does not involve inversion or direct solution of the normal equations Press et al suggest that this method may have problems with rounding errors in problems when X is nearly of short rank but based on other published results this concern seems relatively minor 12

    A 6 12

    THE GENERALIZED INVERSE OF A MATRIX

    Inverse matrices are fundamental in econometrics Although we shall not require them much in our treatment in this book there are more general forms of inverse matrices than we have considered thus far A generalized inverse of a matrix A is another matrix A that satis es the following requirements 1 2 3 4 AA A A A AA A A A is symmetric AA is symmetric

    A unique A can be found for any matrix whether A is singular or not or even if A is not square 13 The unique matrix that satis es all four requirements is called the Moore Penrose inverse or pseudoinverse of A If A happens to be square and nonsingular then the generalized inverse will be the familiar ordinary inverse But if A 1 does not exist then A can still be computed An important special case is the overdetermined system of equations Ab y

    11 Discussion 12 The

    of the singular value decomposition and listings of computer programs for the computations may be found in Press et al 1986

    National Institute of Standards and Technology NIST has published a suite of benchmark problems that test the accuracy of least squares computations http www nist gov itl div898 strd Using these problems which include some extremely dif cult ill conditioned data sets we found that the QR method would reproduce all the NIST certi ed solutions to 15 digits of accuracy which suggests that the QR method should be satisfactory for all but the worst problems proof of uniqueness with several other results may be found in Theil 1983

    13 A

    Greene 50240

    book

    June 28 2002

    14 40

    834

    APPENDIX A Matrix Algebra

    where A has n rows K n columns and column rank equal to R K Suppose that R equals K so that A A 1 exists Then the Moore Penrose inverse of A is A A A 1 A which can be veri ed by multiplication A solution to the system of equations can be written b A y This is the vector that minimizes the length of Ab y Recall this was the solution to the least squares problem obtained in Section A 4 4 If y lies in the column space of A this vector will be zero but otherwise it will not Now suppose that A does not have full rank The previous solution cannot be computed An alternative solution can be obtained however We continue to use the matrix A A In the spectral decomposition of Section A 6 4 if A has rank R then there are R terms in the summation in A 85 In A 102 the spectral decomposition using the reciprocals of the characteristic roots is used to compute the inverse To compute the Moore Penrose inverse we apply this calculation to A A using only the nonzero roots then postmultiply the result by A Let C1 be the R characteristic vectors corresponding to the nonzero roots which we array in the diagonal matrix 1 Then the Moore Penrose inverse is A C1
    1 1 C1 A



    which is very similar to the previous result If A is a symmetric matrix with rank R K the Moore Penrose inverse is computed precisely as in the preceding equation without postmultiplying by A Thus for a symmetric matrix A A C1 where
    1 1 1 C1

    is a diagonal matrix containing the reciprocals of the nonzero roots of A

    A 7

    QUADRATIC FORMS AND DEFINITE MATRICES

    Many optimization problems involve double sums of the form
    n n

    q
    i 1 j 1

    xi x j ai j

    A 109

    This quadratic form can be written q x Ax where A is a symmetric matrix In general q may be positive negative or zero it depends on A and x There are some matrices however for which q will be positive regardless of x and others for which q will always be negative or nonnegative or nonpositive For a given matrix A 1 2 If x Ax 0 for all nonzero x then A is positive negative de nite If x Ax 0 for all nonzero x then A is nonnegative de nite or positive semide nite nonpositive de nite

    It might seem that it would be impossible to check a matrix for de niteness since x can be chosen arbitrarily But we have already used the set of results necessary to do so Recall that a

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    835

    symmetric matrix can be decomposed into A C C Therefore the quadratic form can be written as x Ax x C C x Let y C x Then
    n

    x Ax y

    y
    i 1

    i yi2

    A 110

    If i is positive for all i then regardless of y that is regardless of x q will be positive This case was identi ed earlier as a positive de nite matrix Continuing this line of reasoning we obtain the following theorem

    THEOREM A 11 De nite Matrices
    Let A be a symmetric matrix If all the characteristic roots of A are positive negative then A is positive de nite negative de nite If some of the roots are zero then A is nonnegative nonpositive de nite if the remainder are positive negative If A has both negative and positive roots then A is inde nite

    The preceding statements give in each case the if parts of the theorem To establish the only if parts assume that the condition on the roots does not hold This must lead to a contradiction For example if some can be negative then y y could be negative for some y so A cannot be positive de nite
    A 7 1 NONNEGATIVE DEFINITE MATRICES

    A case of particular interest is that of nonnegative de nite matrices Theorem A 11 implies a number of related results



    If A is nonnegative de nite then A 0 Proof The determinant is the product of the roots which are nonnegative

    A 111

    The converse however is not true For example a 2 2 matrix with two negative roots is clearly not positive de nite but it does have a positive determinant



    If A is positive de nite so is A 1 Proof The roots are the reciprocals of those of A which are therefore positive The identity matrix I is positive de nite Proof x Ix x x 0 if x 0

    A 112

    A 113

    A very important result for regression analysis is



    If A is n K with full column rank and n K then A A is positive de nite and AA is nonnegative de nite A 114 Proof By assumption Ax 0 So x A Ax Ax Ax y y
    j

    y2 0 j

    Greene 50240

    book

    June 28 2002

    14 40

    836

    APPENDIX A Matrix Algebra

    A similar proof establishes the nonnegative de niteness of AA The difference in the latter case is that because A has more rows than columns there is an x such that A x 0 Thus in the proof we only have y y 0 The case in which A does not have full column rank is the same as that of AA



    If A is positive de nite and B is a nonsingular matrix then B AB is positive de nite A 115 Proof x B ABx y Ay 0 where y Bx But y cannot be 0 because B is nonsingular

    Finally note that for A to be negative de nite all A s characteristic roots must be negative But in this case A is positive if A is of even order and negative if A is of odd order
    A 7 2 IDEMPOTENT QUADRATIC FORMS

    Quadratic forms in idempotent matrices play an important role in the distributions of many test statistics As such we shall encounter them fairly often Two central results are of interest



    Every symmetric idempotent matrix is nonnegative de nite

    A 116

    Proof All roots are one or zero hence the matrix is nonnegative de nite by de nition Combining this with some earlier results yields a result used in determining the sampling distribution of most of the standard test statistics



    If A is symmetric and idempotent n n with rank J then every quadratic form in A can be J written x Ax y2 A 117 j 1 j Proof This result is A 110 with one or zero

    A 7 3

    COMPARING MATRICES

    Derivations in econometrics often focus on whether one matrix is larger than another We now consider how to make such a comparison As a starting point the two matrices must have the same dimensions A useful comparison is based on d x Ax x Bx x A B x If d is always positive for any nonzero vector x then by this criterion we can say that A is larger than B The reverse would apply if d is always negative It follows from the de nition that if d 0 for all nonzero x then A B is positive de nite A 118

    If d is only greater than or equal to zero then A B is nonnegative de nite The ordering is not complete For some pairs of matrices d could have either sign depending on x In this case there is no simple comparison A particular case of the general result which we will encounter frequently is If A is positive de nite and B is nonnegative de nite then A B A Consider for example the updating formula introduced in A 66 This uses a matrix A B B bb B B Finally in comparing matrices it may be more convenient to compare their inverses The result analogous to a familiar result for scalars is If A B then B 1 A 1 A 120 A 119

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    837

    In order to establish this intuitive result we would make use of the following which is proved in Goldberger 1964 Chapter 2

    THEOREM A 12 Ordering for Positive De nite Matrices
    If A and B are two positive de nite matrices with the same dimensions and if every characteristic root of A is larger than at least as large as the corresponding characteristic root of B when both sets of roots are ordered from largest to smallest then A B is positive nonnegative de nite

    The roots of the inverse are the reciprocals of the roots of the original matrix so the theorem can be applied to the inverse matrices

    A 8

    CALCULUS AND MATRIX ALGEBRA 14
    DIFFERENTIATION AND THE TAYLOR SERIES

    A 8 1

    A variable y is a function of another variable x written y f x y g x y y x

    and so on if each value of x is associated with a single value of y In this relationship y and x are sometimes labeled the dependent variable and the independent variable respectively Assuming that the function f x is continuous and differentiable we obtain the following derivatives f x dy d2 y f x dx dx 2

    and so on A frequent use of the derivatives of f x is in the Taylor series approximation A Taylor series is a polynomial approximation to f x Letting x 0 be an arbitrarily chosen expansion point
    P

    f x f x0
    i 1

    1 di f x 0 x x 0 i i d x 0 i

    A 121

    The choice of the number of terms is arbitrary the more that are used the more accurate the approximation will be The approximation used most frequently in econometrics is the linear approximation f x x A 122

    where by collecting terms in A 121 f x 0 f x 0 x 0 and f x 0 The superscript 0 indicates that the function is evaluated at x 0 The quadratic approximation is f x x x2 where f 0 f 0 x 0
    14 For 1 2

    A 123
    1 2

    f 0 x 0 2 f 0 f 0 x 0 and

    f 0

    a complete exposition see Magnus and Neudecker 1988

    Greene 50240

    book

    June 28 2002

    14 40

    838

    APPENDIX A Matrix Algebra

    We can regard a function y f x1 x2 xn as a scalar valued function of a vector that is y f x The vector of partial derivatives or gradient vector or simply gradient is y x1 f1 f x y x2 f2 x y xn fn








    A 124

    The vector g x or g is used to represent the gradient Notice that it is a column vector The shape of the derivative is determined by the denominator of the derivative A second derivatives matrix or Hessian is computed as 2 y x1 x1 2 y x2 x1 H 2 y xn x1



    2 y x1 x2 2 y x2 x2 2 y xn x2

    2 y x1 xn 2 y x2 xn fi j 2 y xn xn



    A 125

    In general H is a square symmetric matrix The symmetry is obtained for continuous and continuously differentiable functions from Young s theorem Each column of H is the derivative of g with respect to the corresponding variable in x Therefore H y x y x y x y x y x 2 y x1 x2 xn x1 x2 xn x x x

    The rst order or linear Taylor series approximation is
    n

    y f x 0
    i 1

    fi x0 xi xi0

    A 126

    The right hand side is f x0 f x0 x0 x x0 f x0 g x0 x0 g x0 x f 0 g0 x0 g0 x

    This produces the linear approximation y x The second order or quadratic approximation adds the second order terms in the expansion 1 2
    n n

    fi0 xi xi0 j
    i 1 j 1

    x j x0 j

    1 x x0 H0 x x0 2

    to the preceding one Collecting terms in the same manner as in A 126 we have 1 y x x x 2 where 1 f 0 g0 x0 x0 H0 x0 g0 H0 x0 2 A linear function can be written
    n

    A 127

    and

    H0

    y ax xa
    i 1

    ai xi

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    839

    so a x a x Note in particular that a x x a not a In a set of linear functions y Ax each element yi of y is yi ai x where ai is the i th row of A see A 14 Therefore yi ai transpose of i th row of A x and A 128



    y1 x a1 y2 x a 2 yn x an Collecting all terms we nd that Ax x A whereas the more familiar form will be Ax A x A quadratic form is written
    n n







    A 129

    x Ax
    i 1 j 1

    xi x j ai j

    A 130

    For example A so that
    2 2 x Ax 1x1 4x2 6x1 x2

    1 3

    3 4

    Then x Ax 2x1 6x2 2 6 6x1 8x2 x 6 8 x1 2Ax x2 A 131

    which is the general result when A is a symmetric matrix If A is not symmetric then x Ax A A x x A 132

    Referring to the preceding double summation we nd that for each term the coef cient on ai j is xi x j Therefore x Ax xi x j ai j

    Greene 50240

    book

    June 28 2002

    14 40

    840

    APPENDIX A Matrix Algebra

    The square matrix whose i j th element is xi x j is xx so x Ax xx A A 133

    Derivatives involving determinants appear in maximum likelihood estimation From the cofactor expansion in A 51 A 1 i j A ji ci j ai j where C ji is the ji th cofactor in A The inverse of A can be computed using Ai 1 j 1 i j Ci j A

    note the reversal of the subscripts which implies that 1 i j C ji ln A ai j A or collecting terms ln A A 1 A Since the matrices for which we shall make use of this calculation will be symmetric in our applications the transposition will be unnecessary
    A 8 2 OPTIMIZATION

    Consider nding the x where f x is maximized or minimized Since f x is the slope of f x either optimum must occur where f x 0 Otherwise the function will be increasing or decreasing at x This situation implies the rst order or necessary condition for an optimum maximum or minimum dy 0 dx A 134

    For a maximum the function must be concave for a minimum it must be convex The suf cient condition for an optimum is For a maximum d2 y 0 dx 2

    d2 y for a minimum 2 0 dx

    A 135

    Some functions such as the sine and cosine functions have many local optima that is many minima and maxima A function such as cos x 1 x 2 which is a damped cosine wave does as well but differs in that although it has many local maxima it has one at x 0 at which f x is greater than it is at any other point Thus x 0 is the global maximum whereas the other maxima are only local maxima Certain functions such as a quadratic have only a single optimum These functions are globally concave if the optimum is a maximum and globally convex if it is a minimum For maximizing or minimizing a function of several variables the rst order conditions are f x 0 x A 136

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    841

    This result is interpreted in the same manner as the necessary condition in the univariate case At the optimum it must be true that no small change in any variable leads to an improvement in the function value In the single variable case d2 y dx 2 must be positive for a minimum and negative for a maximum The second order condition for an optimum in the multivariate case is that at the optimizing value H 2 f x x x A 137

    must be positive de nite for a minimum and negative de nite for a maximum In a single variable problem the second order condition can usually be veri ed by inspection This situation will not generally be true in the multivariate case As discussed earlier checking the de niteness of a matrix is in general a dif cult problem For most of the problems encountered in econometrics however the second order condition will be implied by the structure of the problem That is the matrix H will usually be of such a form that it is always de nite For an example of the preceding consider the problem maximizex R a x x Ax where a 5 and 21 A 1 3 32 Using some now familiar results we obtain 5 4 R a 2Ax 4 2 x 2 6 The solutions are x1 42 x2 2 6 64 x3 The suf cient condition is that 4 2 2 R x 2A 2 6 x x 6 4 6 4 10 A 139 6 4 10
    1

    4

    2

    3 2 5

    2 6 4

    6 4 10

    x1 x2 0 x3

    A 138

    5 11 25 4 1 75 2 7 25

    must be negative de nite The three characteristic roots of this matrix are 15 746 4 and 0 25403 Since all three roots are negative the matrix is negative de nite as required In the preceding it was necessary to compute the characteristic roots of the Hessian to verify the suf cient condition For a general matrix of order larger than 2 this will normally require a computer Suppose however that A is of the form A B B where B is some known matrix Then as shown earlier we know that A will always be positive de nite assuming that B has full rank In this case it is not necessary to calculate the characteristic roots of A to verify the suf cient conditions

    Greene 50240

    book

    June 28 2002

    14 40

    842

    APPENDIX A Matrix Algebra A 8 3 CONSTRAINED OPTIMIZATION

    It is often necessary to solve an optimization problem subject to some constraints on the solution One method is merely to solve out the constraints For example in the maximization problem considered earlier suppose that the constraint x1 x2 x3 is imposed on the solution For a single constraint such as this one it is possible merely to substitute the right hand side of this equation for x1 in the objective function and solve the resulting problem as a function of the remaining two variables For more general constraints however or when there is more than one constraint the method of Lagrange multipliers provides a more straightforward method of solving the problem We maximizex f x subject to c1 x 0 c2 x 0 A 140 c J x 0 The Lagrangean approach to this problem is to nd the stationary points that is the points at which the derivatives are zero of
    J

    L x f x
    j 1

    j c j x f x c x

    A 141

    The solutions satisfy the equations f x c x L 0 n 1 x x x L c x 0 J 1 The second term in L x is c x c x c x x x x C A 143

    A 142

    where C is the matrix of derivatives of the constraints with respect to x The j th row of the J n matrix C is the vector of derivatives of the j th constraint c j x with respect to x Upon collecting terms the rst order conditions are L f x C 0 x x L c x 0

    A 144

    There is one very important aspect of the constrained solution to consider In the unconstrained solution we have f x x 0 From A 144 we obtain for a constrained solution f x C x which will not equal 0 unless 0 This equation has two important implications A 145





    The constrained solution cannot be superior to the unconstrained solution This is implied by the nonzero gradient at the constrained solution That is unless C 0 which could happen if the constraints were nonlinear But even if so the solution is still no better than the unconstrained optimum If the Lagrange multipliers are zero then the constrained solution will equal the unconstrained solution

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX A Matrix Algebra

    843

    To continue the example begun earlier suppose that we add the following conditions x1 x2 x3 0 x1 x2 x3 0 To put this in the format of the general problem write the constraints as c x Cx 0 where C The Lagrangean function is R x a x x Ax Cx Note the dimensions and arrangement of the various parts In particular C is a 2 3 matrix with one row for each constraint and one column for each variable in the objective function The vector of Lagrange multipliers thus has two elements one for each constraint The necessary conditions are a 2Ax C 0 and Cx 0 two equations These may be combined in the single equation 2A C C 0 x a 0 three equations A 146 1 1 1 1 11

    Using the partitioned inverse of A 74 produces the solutions CA 1 C 1 CA 1 a and x 1 1 A I C CA 1 C 1 CA 1 a 2 A 148 A 147

    The two results A 147 and A 148 yield analytic solutions for and x For the speci c matrices and vectors of the example these are 0 5 7 5 and the constrained solution vector x 1 5 0 1 5 Note that in computing the solution to this sort of problem it is not necessary to use the rather cumbersome form of A 148 Once is obtained from A 147 the solution can be inserted in A 146 for a much simpler computation The solution x 1 1 1 A a A 1 C 2 2

    suggests a useful result for the constrained optimum constrained solution unconstrained solution 2A 1 C A 149

    Finally by inserting the two solutions in the original function we nd that R 24 375 and R 2 25 which illustrates again that the constrained solution in this maximization problem is inferior to the unconstrained solution

    Greene 50240

    book

    June 28 2002

    14 40

    844

    APPENDIX A Matrix Algebra A 8 4 TRANSFORMATIONS

    If a function is strictly monotonic then it is a one to one function Each y is associated with exactly one value of x and vice versa In this case an inverse function exists which expresses x as a function of y written y f x and x f 1 y An example is the inverse relationship between the log and the exponential functions The slope of the inverse function J d f 1 y dx f 1 y dy dy y a bx then x is the inverse transformation and J 1 dx dy b 1 a y b b

    is the Jacobian of the transformation from y to x For example if

    Looking ahead to the statistical application of this concept we observe that if y f x were vertical then this would no longer be a functional relationship The same x would be associated with more than one value of y In this case at this value of x we would nd that J 0 indicating a singularity in the function If y is a column vector of functions y f x then



    J

    x y

    x1 y1 x2 y1 xn y1

    x1 y2 x2 y2 xn y2

    x1 yn x2 yn xn yn



    Consider the set of linear functions y Ax f x The inverse transformation is x f 1 y which will be x A 1 y if A is nonsingular If A is singular then there is no inverse transformation Let J be the matrix of partial derivatives of the inverse functions J The absolute value of the determinant of J abs J x y xi yj

    is the Jacobian determinant of the transformation from y to x In the nonsingular case abs J abs A 1 1 abs A

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    845

    In the singular case the matrix of partial derivatives will be singular and the determinant of the Jacobian will be zero In this instance the singular Jacobian implies that A is singular or equivalently that the transformations from x to y are functionally dependent The singular case is analogous to the single variable case Clearly if the vector x is given then y Ax can be computed from x Whether x can be deduced from y is another question Evidently it depends on the Jacobian If the Jacobian is not zero then the inverse transformations exist and we can obtain x If not then we cannot obtain x

    APPENDIX B

    Q
    PROBABILITY AND DISTRIBUTION THEORY
    B 1 INTRODUCTION
    This appendix reviews the distribution theory used later in the book Since a previous course in statistics is assumed most of the results will be stated without proof The more advanced results in the later sections will be developed in greater detail

    B 2

    RANDOM VARIABLES

    We view our observation on some aspect of the economy as the outcome of a random process which is almost never under our the analyst s control In the current literature the descriptive and perspective laden term data generating process or DGP is often used for this underlying mechanism The observed measured outcomes of the process are assigned unique numeric values The assignment is one to one each outcome gets one value and no two distinct outcomes receive the same value This outcome variable X is a random variable because until the data are actually observed it is uncertain what value X will take Probabilities are associated with outcomes to quantify this uncertainty We usually use capital letters for the name of a random variable and lowercase letters for the values it takes Thus the probability that X takes a particular value x might be denoted Prob X x A random variable is discrete if the set of outcomes is either nite in number or countably in nite The random variable is continuous if the set of outcomes is in nitely divisible and hence not countable These de nitions will correspond to the types of data we observe in practice Counts of occurrences will provide observations on discrete random variables whereas measurements such as time or income will give observations on continuous random variables
    B 2 1 PROBABILITY DISTRIBUTIONS

    A listing of the values x taken by a random variable X and their associated probabilities is a probability distribution f x For a discrete random variable f x Prob X x B 1

    Greene 50240

    book

    June 28 2002

    14 40

    846

    APPENDIX B Probability and Distribution Theory

    The axioms of probability require that 1 2 0 Prob X x 1 f x 1 x B 2 B 3

    For the continuous case the probability associated with any particular point is zero and we can only assign positive probabilities to intervals in the range of x The probability density function pdf is de ned so that f x 0 and
    b

    1

    Prob a x b
    a

    f x dx 0

    B 4

    This result is the area under f x in the range from a to b For a continuous variable


    2


    f x dx 1

    B 5

    If the range of x is not in nite then it is understood that f x 0 anywhere outside the appropriate range Since the probability associated with any individual point is 0 Prob a x b Prob a x b Prob a x b Prob a x b
    B 2 2 CUMULATIVE DISTRIBUTION FUNCTION

    For any random variable X the probability that X is less than or equal to a is denoted F a F x is the cumulative distribution function cdf For a discrete random variable F x
    X x

    f X Prob X x

    B 6

    In view of the de nition of f x f xi F xi F xi 1 For a continuous random variable
    x

    B 7

    F x


    f t dt

    B 8

    and f x dF x dx B 9

    In both the continuous and discrete cases F x must satisfy the following properties 1 2 3 4 0 F x 1 If x y then F x F y F 1 F 0

    From the de nition of the cdf Prob a x b F b F a B 10

    Any valid pdf will imply a valid cdf so there is no need to verify these conditions separately

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    847

    B 3

    EXPECTATIONS OF A RANDOM VARIABLE

    DEFINITION B 1 Mean of a Random Variable
    The mean or expected value of a random variable is

    E x



    x f x
    x

    if x is discrete B 11

    x f x dx
    x

    if x is continuous

    The notation or x used henceforth means the sum or integral over the entire range x of values of x The mean is usually denoted It is a weighted average of the values taken by x where the weights are the respective probabilities It is not necessarily a value actually taken by the random variable For example the expected number of heads in one toss of a fair coin is 1 2 Other measures of central tendency are the median which is the value m such that Prob X m 1 and Prob X m 1 and the mode which is the value of x at which f x takes its 2 2 maximum The rst of these measures is more frequently used than the second Loosely speaking the median corresponds more closely than the mean to the middle of a distribution It is unaffected by extreme values In the discrete case the modal value of x has the highest probability of occurring Let g x be a function of x The function that gives the expected value of g x is denoted

    E g x



    g x Prob X x
    x

    if X is discrete B 12

    g x f x d
    x

    if X is continuous

    If g x a bx for constants a and b then E a bx a bE x An important case is the expected value of a constant a which is just a

    DEFINITION B 2 Variance of a Random Variable
    The variance of a random variable is Var x E x 2





    x 2 f x
    x

    if x is discrete B 13

    x 2 f x dx if x is continuous
    x

    Var x which must be positive is usually denoted 2 This function is a measure of the dispersion of a distribution Computation of the variance is simpli ed by using the following

    Greene 50240

    book

    June 28 2002

    14 40

    848

    APPENDIX B Probability and Distribution Theory

    important result Var x E x 2 2 A convenient corollary to B 14 is E x 2 2 2 By inserting y a bx in B 13 and expanding we nd that Var a bx b2 Var x which implies for any constant a that Var a 0 B 17 To describe a distribution we usually use the positive square root which is the standard deviation of x The standard deviation can be interpreted as having the same units of measurement as x and For any random variable x and any positive constant k the Chebychev inequality states that 1 Prob k x k 1 2 B 18 k Two other measures often used to describe a probability distribution are skewness E x 3 and kurtosis E x 4 Skewness is a measure of the asymmetry of a distribution For symmetric distributions f x f x and skewness 0 For asymmetric distributions the skewness will be positive if the long tail is in the positive direction Kurtosis is a measure of the thickness of the tails of the distribution A shorthand expression for other central moments is r E x r Since r tends to explode as r grows the normalized measure r r is often used for description Two common measures are 3 skewness coef cient 3 and degree of excess 4 3 4 B 16 B 15 B 14

    The second is based on the normal distribution which has excess of zero For any two functions g1 x and g2 x E g1 x g2 x E g1 x E g2 x For the general case of a possibly nonlinear g x E g x
    x

    B 19

    g x f x dx

    B 20

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    849

    and Var g x
    x

    g x E g x

    2

    f x dx

    B 21

    For convenience we shall omit the equivalent de nitions for discrete variables in the following discussion and use the integral to mean either integration or summation whichever is appropriate A device used to approximate E g x and Var g x is the linear Taylor series approximation g x g x 0 g x 0 x 0 g x 0 x 1 2 x g x B 22

    If the approximation is reasonably accurate then the mean and variance of g x will be approximately equal to the mean and variance of g x A natural choice for the expansion point is x 0 E x Inserting this value in B 22 gives g x g g g x so that E g x g and Var g x g 2 Var x B 25 B 24 B 23

    A point to note in view of B 22 to B 24 is that E g x will generally not equal g E x For the special case in which g x is concave that is where g x 0 we know from Jensen s inequality that E g x g E x For example E log x log E x

    B 4

    SOME SPECIFIC PROBABILITY DISTRIBUTIONS

    Certain experimental situations naturally give rise to speci c probability distributions In the majority of cases in economics however the distributions used are merely models of the observed phenomena Although the normal distribution which we shall discuss at length is the mainstay of econometric research economists have used a wide variety of other distributions A few are discussed here 1
    B 4 1 THE NORMAL DISTRIBUTION

    The general form of a normal distribution with mean and standard deviation is f x 2 1 22 e 1 2 x 2 B 26

    This result is usually denoted x N 2 The standard notation x f x is used to state that x has probability distribution f x Among the most useful properties of the normal distribution
    1A

    much more complete listing appears in Maddala 1977a Chaps 3 and 18 and in most mathematical statistics textbooks See also Poirier 1995 and Stuart and Ord 1989 Another useful reference is Evans Hastings and Peacock 1993 Johnson et al 1970 1974 1993 is an encyclopedic reference on the subject of statistical distributions

    Greene 50240

    book

    June 28 2002

    14 40

    850

    APPENDIX B Probability and Distribution Theory

    is its preservation under linear transformation If x N 2 then a bx N a b b2 2 B 27

    One particularly convenient transformation is a and b 1 The resulting variable z x has the standard normal distribution denoted N 0 1 with density 1 2 z e z 2 2 The speci c notation z is often used for this distribution and the de nitions above that if x N 2 then f x x 1 B 28 z for its cdf It follows from

    Figure B 1 shows the densities of the standard normal distribution and the normal distribution with mean 0 5 which shifts the distribution to the right and standard deviation 1 3 which it can be seen scales the density so that it is shorter but wider The graph is a bit deceiving unless you look closely both densities are symmetric Tables of the standard normal cdf appear in most statistics and econometrics textbooks Because the form of the distribution does not change under a linear transformation it is not necessary to tabulate the distribution for other values of and For any normally distributed variable Prob a x b Prob a x b B 29

    which can always be read from a table of the standard normal distribution In addition because the distribution is symmetric z 1 z Hence it is not necessary to tabulate both the negative and positive halves of the distribution

    FIGURE B 1

    The Normal Distribution

    f1 Normal 0 1 and f2 Normal 5 1 3 Densities 42
    F1 F2

    34

    Density

    25

    17

    08

    00 4 0

    2 4

    8 Z

    8

    2 4

    4 0

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory B 4 2 THE CHI SQUARED t AND F DISTRIBUTIONS

    851

    The chi squared t and F distributions are derived from the normal distribution They arise in econometrics as sums of n or n1 and n2 other variables These three distributions have associated with them one or two degrees of freedom parameters which for our purposes will be the number of variables in the relevant sum The rst of the essential results is



    If z N 0 1 then x z2 chi squared 1 that is chi squared with one degree of freedom denoted z2 2 1 B 30

    This result is a skewed distribution with mean 1 and variance 2 The second is



    If x1 xn are n independent chi squared 1 variables then
    n

    xi chi squared n
    i 1

    B 31

    The mean and variance of a chi squared variable with n degrees of freedom are n and 2n respectively A number of useful corollaries can be derived using B 30 and B 31



    If zi i 1 n are independent N 0 1 variables then
    n

    zi2 2 n
    i 1

    B 32



    If zi i 1 n are independent N 0 2 variables then
    n

    zi 2 2 n
    i 1

    B 33



    If x1 and x2 are independent chi squared variables with n1 and n2 degrees of freedom respectively then x1 x2 2 n1 n2 This result can be generalized to the sum of an arbitrary number of independent chi squared variables B 34

    Figure B 2 shows the chi squared density for three degrees of freedom The amount of skewness declines as the number of degrees of freedom rises Unlike the normal distribution a separate table is required for the chi squared distribution for each value of n Typically only a few percentage points of the distribution are tabulated for each n Table G 3 in Appendix G of this book gives upper right tail areas for a number of values



    If x1 and x2 are two independent chi squared variables with degrees of freedom parameters n1 and n2 respectively then the ratio x1 n1 F n1 n2 B 35 x2 n2 has the F distribution with n1 and n2 degrees of freedom

    The two degrees of freedom parameters n1 and n2 are the numerator and denominator degrees of freedom respectively Tables of the F distribution must be computed for each pair of values of n1 n2 As such only one or two speci c values such as the 95 percent and 99 percent upper tail values are tabulated in most cases

    Greene 50240

    book

    June 28 2002

    14 40

    852

    APPENDIX B Probability and Distribution Theory

    Chi squared 3 Density 30

    24

    Density

    18

    12

    06

    00 0
    FIGURE B 2

    2

    4 X

    6

    8

    10

    The Chi squared 3 Distribution



    If z is an N 0 1 variable and x is 2 n and is independent of z then the ratio t n z x n B 36

    has the t distribution with n degrees of freedom The t distribution has the same shape as the normal distribution but has thicker tails Figure B 3 illustrates the t distributions with three and 10 degrees of freedom with the standard normal distribution Two effects that can be seen in the gure are how the distribution changes as the degrees of freedom increases and overall the similarity of the t distribution to the standard normal This distribution is tabulated in the same manner as the chi squared distribution with several speci c cutoff points corresponding to speci ed tail areas for various values of the degrees of freedom parameter Comparing B 35 with n1 1 and B 36 we see the useful relationship between the t and F distributions


    1

    If t t n then t 2 F 1 n Noncentral chi squared distribution If z has a normal distribution with mean and standard deviation 1 then the distribution of z2 is noncentral chi squared with parameters 1 and 2 2 If equals zero then the familiar central chi squared distribution results The extensions that will enable us to deduce the distribution of F when the restrictions do not hold in the population are a If z N with J elements then z 1 z has a noncentral chi squared distribution with J degrees of freedom and noncentrality parameter 1 2 which we denote 2 J 1 2 2 b If z N I and M is an idempotent matrix with rank J then z Mz J M 2

    and the ratio has a noncentral F distribution These distributions arise as follows

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    853

    Normal 0 1 t 3 and t 10 Densities 45
    NORMAL T3 T10

    36

    Density

    27

    18

    09

    00 4 0
    FIGURE B 3

    2 4

    8 Z

    8

    2 4

    4 0

    The Standard Normal t 3 and t 10 Distributions

    2

    Noncentral F distribution If X1 has a noncentral chi squared distribution with noncentrality parameter and degrees of freedom n1 and X2 has a central chi squared distribution with degrees of freedom n2 and is independent of X1 then F X1 n1 X2 n2

    has a noncentral F distribution with parameters n1 n2 and 2 Note that in each of these cases the statistic and the distribution are the familiar ones except that the effect of the nonzero mean which induces the noncentrality is to push the distribution to the right
    B 4 3 DISTRIBUTIONS WITH LARGE DEGREES OF FREEDOM

    The chi squared t and F distributions usually arise in connection with sums of sample observations The degrees of freedom parameter in each case grows with the number of observations We often deal with larger degrees of freedom than are shown in the tables Thus the standard tables are often inadequate In all cases however there are limiting distributions that we can use when the degrees of freedom parameter grows large The simplest case is the t distribution The t distribution with in nite degrees of freedom is equivalent to the standard normal distribution Beyond about 100 degrees of freedom they are almost indistinguishable For degrees of freedom greater than 30 a reasonably good approximation for the distribution of the chi squared variable x is z 2x 1 2 2n 1 1 2
    2 The

    B 37

    denominator chi squared could also be noncentral but we shall not use any statistics with doubly noncentral distributions

    Greene 50240

    book

    June 28 2002

    14 40

    854

    APPENDIX B Probability and Distribution Theory

    which is approximately standard normally distributed Thus Prob 2 n a 2a 1 2 2n 1 1 2

    As used in econometrics the F distribution with a large denominator degrees of freedom is common As n2 becomes in nite the denominator of F converges identically to one so we can treat the variable x n1 F B 38

    as a chi squared variable with n1 degrees of freedom Since the numerator degree of freedom will typically be small this approximation will suf ce for the types of applications we are likely to encounter 3 If not then the approximation given earlier for the chi squared distribution can be applied to n1 F
    B 4 4 SIZE DISTRIBUTIONS THE LOGNORMAL DISTRIBUTION

    In modeling size distributions such as the distribution of rm sizes in an industry or the distribution of income in a country the lognormal distribution denoted LN 2 has been particularly useful 4 f x A lognormal variable x has E x e and Var x e2 e 1 The relation between the normal and lognormal distributions is If y LN 2 ln y N 2 A useful result for transformations is given as follows If x has a lognormal distribution with mean and variance 2 then ln x N 2 where ln 2 1 ln 2 2 2 and 2 ln 1 2 2
    2 2 2 2

    1 2 x

    e 1 2 ln x

    2

    x 0

    Since the normal distribution is preserved under linear transformation if y LN 2 then ln yr N r r 2 2
    2 2 If y1 and y2 are independent lognormal variables with y1 LN 1 1 and y2 LN 2 2 then 2 2 y1 y2 LN 1 2 1 2

    3 See 4A

    Johnson and Kotz 1970 for other approximations

    study of applications of the lognormal distribution appears in Aitchison and Brown 1969

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory B 4 5 THE GAMMA AND EXPONENTIAL DISTRIBUTIONS

    855

    The gamma distribution has been used in a variety of settings including the study of income distribution5 and production functions 6 The general form of the distribution is f x P x P 1 ex P x 0 0 P 0 B 39

    Many familiar distributions are special cases including the exponential distribution P 1 and chi squared 1 P n The Erlang distribution results if P is a positive integer The mean 2 2 is P and the variance is P 2
    B 4 6 THE BETA DISTRIBUTION

    Distributions for models are often chosen on the basis of the range within which the random variable is constrained to vary The lognormal distribution for example is sometimes used to model a variable that is always nonnegative For a variable constrained between 0 and c 0 the beta distribution has proved useful Its density is f x x c
    1

    1

    x c

    1

    1 c

    B 40

    This functional form is extremely exible in the shapes it will accommodate It is symmetric if asymmetric otherwise and can be hump shaped or U shaped The mean is c and the variance is c2 1 2 The beta distribution has been applied in the study of labor force participation rates 7
    B 4 7 THE LOGISTIC DISTRIBUTION

    The normal distribution is ubiquitous in econometrics But researchers have found that for some microeconomic applications there does not appear to be enough mass in the tails of the normal distribution observations that a model based on normality would classify as unusual seem not to be very unusual at all One approach has been to use thicker tailed symmetric distributions The logistic distribution is one candidate the cdf for a logistic random variable is denoted F x The density is f x and 2 3
    B 4 8

    x

    1 1 e x

    x 1

    x The mean and variance of this random variable are zero

    DISCRETE RANDOM VARIABLES

    Modeling in economics frequently involves random variables that take integer values In these cases the distributions listed thus far only provide approximations that are sometimes quite inappropriate We can build up a class of models for discrete random variables from the Bernoulli distribution for a single binomial outcome trial Prob x 1 Prob x 0 1
    5 Salem

    and Mount 1974 1980a and Willis 1976

    6 Greene

    7 Heckman

    Greene 50240

    book

    June 28 2002

    14 40

    856

    APPENDIX B Probability and Distribution Theory

    0 2500

    0 1875

    0 1250

    0 0625

    0 0000 0 1 2 3 4 5 X
    FIGURE B 4 The Poisson 3 Distribution

    6

    7

    8

    9

    10

    11

    where 0 1 The modeling aspect of this speci cation would be the assumptions that the success probability is constant from one trial to the next and that successive trials are independent If so then the distribution for x successes in n trials is the binomial distribution Prob X x nx 1 n x x x 0 1 n

    The mean and variance of x are n and n 1 respectively If the number of trials becomes large at the same time that the success probability becomes small so that the mean n is stable the limiting form of the binomial distribution is the Poisson distribution Prob X x e x x

    The Poisson distribution has seen wide use in econometrics in for example modeling patents crime recreation demand and demand for health services

    B 5

    THE DISTRIBUTION OF A FUNCTION OF A RANDOM VARIABLE

    We considered nding the expected value of a function of a random variable It is fairly common to analyze the random variable itself which results when we compute a function of some random variable There are three types of transformation to consider One discrete random variable may be transformed into another a continuous variable may be transformed into a discrete one and one continuous variable may be transformed into another

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    857

    Density

    a
    1

    b
    2 3

    c
    4

    d
    5

    e
    6

    Income

    Relative frequency
    1

    2

    3

    4

    5

    6

    FIGURE B 5

    Censored Distribution

    The simplest case is the rst one The probabilities associated with the new variable are computed according to the laws of probability If y is derived from x and the function is one to one then the probability that Y y x equals the probability that X x If several values of x yield the same value of y then Prob Y y is the sum of the corresponding probabilities for x The second type of transformation is illustrated by the way individual data on income are typically obtained in a survey Income in the population can be expected to be distributed according to some skewed continuous distribution such as the one shown in Figure B 5 Data are often reported categorically as shown in the lower part of the gure Thus the random variable corresponding to observed income is a discrete transformation of the actual underlying continuous random variable Suppose for example that the transformed variable y is the mean income in the respective interval Then Prob Y 1 P X a Prob Y 2 P a X b Prob Y 3 P b X c and so on which illustrates the general procedure If x is a continuous random variable with pdf fx x and if y g x is a continuous monotonic function of x then the density of y is obtained by using the change of variable technique to nd

    Greene 50240

    book

    June 28 2002

    14 40

    858

    APPENDIX B Probability and Distribution Theory

    the cdf of y
    b

    Prob y b


    fx g 1 y g 1 y dy

    This equation can now be written as
    b

    Prob y b


    fy y dy

    Hence fy y fx g 1 y g 1 y B 41

    To avoid the possibility of a negative pdf if g x is decreasing we use the absolute value of the derivative in the previous expression The term g 1 y must be nonzero for the density of y to be nonzero In words the probabilities associated with intervals in the range of y must be associated with intervals in the range of x If the derivative is zero the correspondence y g x is vertical and hence all values of y in the given range are associated with the same value of x This single point must have probability zero One of the most useful applications of the preceding result is the linear transformation of a normally distributed variable If x N 2 then the distribution of y x

    is found using the result above First the derivative is y Therefore 1 1 2 2 2 e y 2 e y 2 fy y 2 2 This is the density of a normally distributed variable with mean zero and standard deviation one It is this result which makes it unnecessary to have separate tables for the different normal distributions which result from different means and variances x dx x y f 1 y dy

    B 6

    REPRESENTATIONS OF A PROBABILITY DISTRIBUTION

    The probability density function pdf is a natural and familiar way to formulate the distribution of a random variable But there are many other functions that are used to identify or characterize a random variable depending on the setting In each of these cases we can identify some other function of the random variable that has a one to one relationship with the density We have already used one of these quite heavily in the preceding discussion For a random variable which has density function f x the distribution function or pdf F x is an equally informative function that identi es the distribution the relationship between f x and F x is de ned in B 6 for a discrete random variable and B 8 for a continuous one We now consider several other related functions For a continuous random variable the survival function is S x 1 F x Prob X x This function is widely used in epidemiology where x is time until some transition such as recovery

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    859

    from a disease The hazard function for a random variable is h x f x f x S x 1 F x

    The hazard function is a conditional probability h x limt 0 Prob X x X t X x hazards have been used in econometrics in studying the duration of spells or conditions such as unemployment strikes time until business failures and so on The connection between the hazard and the other functions is h x d ln S x dx As an exercise you might want to verify the interesting special case of h x 1 a constant the only distribution which has this characteristic is the exponential distribution noted in Section B 4 5 For the random variable X with probability density function f x if the function M t E et x exists then it is the moment generating function Assuming the function exists it can be shown that dr M t dt r t 0 E xr The moment generating function like the survival and the hazard functions is a unique characterization of a probability distribution When it exists the moment generating function has a one to one correspondence with the distribution Thus for example if we begin with some random variable and nd that a transformation of it has a particular MGF then we may infer that the function of the random variable has the distribution associated with that MGF A convenient application of this result is the MGF for the normal distribution The MGF for the standard 2 normal distribution is Mz t et 2 A useful feature of MGFs is the following if x and y are independent then the MGF of x y is Mx t My t This result has been used to establish the contagion property of some distributions that is the property that sums of random variables with a given distribution have that same distribution The normal distribution is a familiar example This is usually not the case It is for Poisson and chi squared random variables One quali cation of all of the preceding is that in order for these results to hold the MGF must exist It will for the distributions that we will encounter in our work but in at least one important case we cannot be sure of this When computing sums of random variables which may have different distributions and whose speci c distributions need not be so well behaved it is likely that the MGF of the sum does not exist However the characteristic function t E eitx will always exist at least for relatively small t The characteristic function is the device used to prove that certain sums of random variables converge to a normally distributed variable that is the characteristic function is a fundamental tool in proofs of the central limit theorem

    Greene 50240

    book

    June 28 2002

    14 40

    860

    APPENDIX B Probability and Distribution Theory

    B 7

    JOINT DISTRIBUTIONS

    The joint density function for two random variables X and Y denoted f x y is de ned so that

    Prob a x b c y d

    f x y a x b c y d
    a b c d

    if x and y are discrete B 42 if x and y are continuous

    f x y dy dx

    The counterparts of the requirements for a univariate probability density are f x y 0 f x y 1
    x y

    if x and y are discrete B 43

    f x y dy dx 1
    x y

    if x and y are continuous

    The cumulative probability is likewise the probability of a joint event F x y Prob X x Y y



    X x
    x

    f x y
    Y y y

    in the discrete case B 44

    f t s ds dt


    in the continuous case

    B 7 1

    MARGINAL DISTRIBUTIONS

    A marginal probability density or marginal probability distribution is de ned with respect to an individual variable To obtain the marginal distributions from the joint density it is necessary to sum or integrate out the other variable

    fx x



    f x y
    y

    in the discrete case B 45

    f x s ds
    y

    in the continuous case

    and similarly for fy y Two random variables are statistically independent if and only if their joint density is the product of the marginal densities f x y fx x fy y x and y are independent If and only if x and y are independent then the cdf factors as well as the pdf F x y Fx x Fy y or Prob X x Y y Prob X x Prob Y y B 47 B 46

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory B 7 2 EXPECTATIONS IN A JOINT DISTRIBUTION

    861

    The means variances and higher moments of the variables in a joint distribution are de ned with respect to the marginal distributions For the mean of x in a discrete distribution E x
    x

    x fx x


    x

    x
    y

    f x y x f x y

    B 48


    x y

    The means of the variables in a continuous distribution are de ned likewise using integration instead of summation E x
    x

    x fx x dx B 49 x f x y dy dx
    x y

    Variances are computed in the same manner Var x
    x

    x E x

    2

    fx x
    2


    x y

    x E x

    f x y

    B 50

    B 7 3

    COVARIANCE AND CORRELATION

    For any function g x y

    E g x y



    g x y f x y
    x y

    in the discrete case B 51 in the continuous case

    g x y f x y dy dx
    x y

    The covariance of x and y is a special case Cov x y E x x y y E xy x y xy If x and y are independent then f x y fx x fy y and xy
    x y

    B 52

    fx x fy y x x y y x x fx x
    x y



    y y fy y

    E x x E y y 0

    Greene 50240

    book

    June 28 2002

    14 40

    862

    APPENDIX B Probability and Distribution Theory

    The sign of the covariance will indicate the direction of covariation of X and Y Its magnitude depends on the scales of measurement however In view of this fact a preferable measure is the correlation coef cient r x y xy xy x y B 53

    where x and y are the standard deviations of x and y respectively The correlation coef cient has the same sign as the covariance but is always between 1 and 1 and is thus unaffected by any scaling of the variables Variables that are uncorrelated are not necessarily independent For example in the discrete distribution f 1 1 f 0 0 f 1 1 1 the correlation is zero but f 1 1 does not 3 equal fx 1 fy 1 1 2 An important exception is the joint normal distribution discussed sub3 3 sequently in which lack of correlation does imply independence Some general results regarding expectations in a joint distribution which can be veri ed by applying the appropriate de nitions are E ax by c a E x bE y c Var ax by c a 2 Var x b2 Var y 2ab Cov x y Var ax by and Cov ax by cx dy ac Var x bd Var y ad bc Cov x y If X and Y are uncorrelated then Var x y Var x y Var x Var y For any two functions g1 x and g2 y if x and y are independent then E g1 x g2 y E g1 x E g2 y
    B 7 4 DISTRIBUTION OF A FUNCTION OF BIVARIATE RANDOM VARIABLES

    B 54

    B 55

    B 56

    B 57

    B 58

    The result for a function of a random variable in B 41 must be modi ed for a joint distribution Suppose that x1 and x2 have a joint distribution fx x1 x2 and that y1 and y2 are two monotonic functions of x1 and x2 y1 y1 x1 x2 y2 y2 x1 x2 Since the functions are monotonic the inverse transformations x1 x1 y1 y2 x2 x2 y1 y2

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    863

    exist The Jacobian determinant of the transformations is the determinant of the matrix of partial derivatives J The joint distribution of y1 and y2 is fy y1 y2 fx x1 y1 y2 x2 y1 y2 abs J The determinant must be nonzero for the transformation to exist A zero determinant implies that the two transformations are functionally dependent Certainly the most common application of the preceding in econometrics is the linear transformation of a set of random variables Suppose that x1 and x2 are independently distributed N 0 1 and the transformations are y1 1 11 x1 12 x2 y2 2 21 x1 22 x2 To obtain the joint distribution of y1 and y2 we rst write the transformations as y a Bx The inverse transformation is x B 1 y a so the absolute value of the Jacobian determinant is abs J abs B 1 1 abs B x1 y1 x2 y1 x1 y2 x2 y2 x y

    The joint distribution of x is the product of the marginal distributions since they are independent Thus fx x 2 1 e x1 x2 2 2 1 ex x 2 Inserting the results for x y and J into fy y1 y2 gives fy y 2 1 1 1 e y a BB y a 2 abs B
    2 2

    This bivariate normal distribution is the subject of Section B 9 Note that by formulating it as we did above we can generalize directly to the multivariate case that is with an arbitrary number of variables Perhaps the more common situation is that in which it is necessary to nd the distribution of one function of two or more random variables A strategy that often works in this case is to form the joint distribution of the transformed variable and one of the original variables then integrate or sum the latter out of the joint distribution to obtain the marginal distribution Thus to nd the distribution of y1 x1 x2 we might formulate y1 y1 x1 x2 y2 x2

    Greene 50240

    book

    June 28 2002

    14 40

    864

    APPENDIX B Probability and Distribution Theory

    The Jacobian would then be x1 J abs y1 0 The density of y1 would then be fy1 y1
    y2

    x1 y2 abs x1 y1 1

    fx x1 y1 y2 y2 dy2

    B 8

    CONDITIONING IN A BIVARIATE DISTRIBUTION

    Conditioning and the use of conditional distributions play a pivotal role in econometric modeling We consider some general results for a bivariate distribution All these results can be extended directly to the multivariate case In a bivariate distribution there is a conditional distribution over y for each value of x The conditional densities are f y x and f x y It follows from B 46 that If x and y are independent then f y x fy y and f x y fx x B 60 f x y fy y f x y fx x B 59

    The interpretation is that if the variables are independent the probabilities of events relating to one variable are unrelated to the other The de nition of conditional densities implies the important result f x y f y x fx x f x y fy y
    B 8 1 REGRESSION THE CONDITIONAL MEAN

    B 61

    A conditional mean is the mean of the conditional distribution and is de ned by

    E y x



    yf y x dy if y is continuous
    y

    B 62 yf y x if y is discrete

    y

    The conditional mean function E y x is called the regression of y on x A random variable may always be written as y E y x y E y x E y x

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory B 8 2 CONDITIONAL VARIANCE

    865

    A conditional variance is the variance of the conditional distribution Var y x E
    y

    y E y x y E y x

    2

    x B 63 f y x dy if y is continuous

    2

    or Var y x
    y

    y E y x

    2

    f y x

    if y is discrete

    B 64

    The computation can be simpli ed by using Var y x E y2 x E y x
    2

    B 65

    The conditional variance is called the scedastic function and like the regression is generally a function of x Unlike the conditional mean function however it is common for the conditional variance not to vary with x We shall examine a particular case This case does not imply however that Var y x equals Var y which will usually not be true It implies only that the conditional variance is a constant The case in which the conditional variance does not vary with x is called homoscedasticity same variance

    B 8 3

    RELATIONSHIPS AMONG MARGINAL AND CONDITIONAL MOMENTS

    Some useful results for the moments of a conditional distribution are given in the following theorems

    THEOREM B 1 Law of Iterated Expectations
    E y Ex E y x B 66

    The notation Ex indicates the expectation over the values of x Note that E y x is a function of x

    THEOREM B 2 Covariance
    In any bivariate distribution Cov x y Covx x E y x
    x

    x E x E y x fx x dx

    B 67

    Note that this is the covariance of x and a function of x

    Greene 50240

    book

    June 28 2002

    14 40

    866

    APPENDIX B Probability and Distribution Theory

    The preceding results provide an additional extremely useful result for the special case in which the conditional mean function is linear in x

    THEOREM B 3 Moments in a Linear Regression
    If E y x x then E y E x and The proof follows from B 66 Cov x y Var x B 68

    The preceding theorems relate to the conditional mean in a bivariate distribution The following theorems which also appears in various forms in regression analysis describe the conditional variance

    THEOREM B 4 Decomposition of Variance
    In a joint distribution Var y Varx E y x Ex Var y x B 69

    The notation Varx indicates the variance over the distribution of x This equation states that in a bivariate distribution the variance of y decomposes into the variance of the conditional mean function plus the expected variance around the conditional mean

    THEOREM B 5 Residual Variance in a Regression
    In any bivariate distribution Ex Var y x Var y Varx E y x B 70

    On average conditioning reduces the variance of the variable subject to the conditioning For example if y is homoscedastic then we have the unambiguous result that the variance of the conditional distribution s is less than or equal to the unconditional variance of y Going a step further we have the result that appears prominently in the bivariate normal distribution Section B 9

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    867

    THEOREM B 6 Linear Regression and Homoscedasticity
    In a bivariate distribution if E y x x and if Var y x is a constant then
    2 2 Var y x Var y 1 Corr2 y x y 1 xy

    B 71

    The proof is straightforward using Theorems B 2 to B 4

    B 8 4

    THE ANALYSIS OF VARIANCE

    The variance decomposition result implies that in a bivariate distribution variation in y arises from two sources 1 Variation because E y x varies with x regression variance Varx E y x 2 B 72

    Variation because in each conditional distribution y varies around the conditional mean residual variance Ex Var y x B 73

    Thus Var y regression variance residual variance B 74

    In analyzing a regression we shall usually be interested in which of the two parts of the total variance Var y is the larger one A natural measure is the ratio coef cient of determination regression variance total variance B 75

    In the setting of a linear regression B 75 arises from another relationship that emphasizes the interpretation of the correlation coef cient If E y x x then the coef cient of determination COD 2 B 76

    where 2 is the squared correlation between x and y We conclude that the correlation coef cient squared is a measure of the proportion of the variance of y accounted for by variation in the mean of y given x It is in this sense that correlation can be interpreted as a measure of linear association between two variables

    B 9

    THE BIVARIATE NORMAL DISTRIBUTION

    A bivariate distribution that embodies many of the features described earlier is the bivariate normal which is the joint distribution of two normally distributed variables The density is f x y 1 2 x y 1 2 e 1 2 x y 2 x y 1
    2 2 2

    x x y y x y x y

    B 77

    Greene 50240

    book

    June 28 2002

    14 40

    868

    APPENDIX B Probability and Distribution Theory

    The parameters x x y and y are the means and standard deviations of the marginal distributions of x and y respectively The additional parameter is the correlation between x and y The covariance is xy x y B 78

    The density is de ned only if is not 1 or 1 which in turn requires that the two variables not be linearly related If x and y have a bivariate normal distribution denoted
    2 2 x y N2 x y x y

    then



    The marginal distributions are normal
    2 fx x N x x 2 fy y N y y

    B 79



    The conditional distributions are normal
    2 f y x N x y 1 2

    y x

    xy 2 x

    B 80



    and likewise for f x y x and y are independent if and only if 0 The density factors into the product of the two marginal normal distributions if 0

    Two things to note about the conditional distributions beyond their normality are their linear regression functions and their constant conditional variances The conditional variance is less than the unconditional variance which is consistent with the results of the previous section

    B 10

    MULTIVARIATE DISTRIBUTIONS

    The extension of the results for bivariate distributions to more than two variables is direct It is made much more convenient by using matrices and vectors The term random vector applies to a vector whose elements are random variables The joint density is f x whereas the cdf is
    xn

    F x


    xn 1



    xn 1

    f x dx1 dxn 1 dxn

    B 81

    Note that the cdf is an n fold integral The marginal distribution of any one or more of the n variables is obtained by integrating or summing over the other variables
    B 10 1 MOMENTS

    The expected value of a vector or matrix is the vector or matrix of expected values A mean vector is de ned as 1 E x1 2 E x2 E x E xn n







    B 82

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    869

    De ne the matrix

    x1 1 x1 1 x2 2 x1 1 x x xn n x1 1



    x1 1 xn n x2 2 xn n xn n x2 2 xn n xn n

    x1 1 x2 2 x2 2 x2 2



    The expected value of each element in the matrix is the covariance of the two variables in the product The covariance of a variable with itself is its variance Thus 11 21 E x x n1



    12 22 n2

    1n 2n E xx nn



    B 83

    which is the covariance matrix of the random vector x Henceforth we shall denote the covariance matrix of a random vector in boldface as in Var x

    By dividing i j by i j we obtain the correlation matrix 1 21 R n1
    B 10 2



    12 1 n2

    13 23 n3

    1n 2n 1



    SETS OF LINEAR FUNCTIONS

    Our earlier results for the mean and variance of a linear function can be extended to the multivariate case For the mean E a1 x1 a2 x2 an xn E a x a1 E x1 a2 E x2 an E xn a1 1 a2 2 an n a For the variance Var a x E a x E a x E a x E x
    2 2

    B 84

    E a x x a as E x and a x x a Since a is a vector of constants
    n n

    Var a x a E x x a a

    a
    i 1 j 1

    ai a j i j

    B 85

    Greene 50240

    book

    June 28 2002

    14 40

    870

    APPENDIX B Probability and Distribution Theory

    Since it is the expected value of a square we know that a variance cannot be negative As such the preceding quadratic form is nonnegative and the symmetric matrix must be nonnegative de nite In the set of linear functions y Ax the ith element of y is yi ai x where ai is the ith row of A see result A 14 Therefore E yi ai Collecting the results in a vector we have E Ax A For two row vectors ai and a j Cov ai x a j x ai a j Since ai a j is the ijth element of A A Var Ax A A B 87 B 86

    This matrix will be either nonnegative de nite or positive de nite depending on the column rank of A
    B 10 3 NONLINEAR FUNCTIONS

    Consider a set of possibly nonlinear functions of x y g x Each element of y can be approximated with a linear Taylor series Let ji be the row vector of partial derivatives of the i th function with respect to the n elements of x ji x gi x yi x x B 88

    Then proceeding in the now familiar way we use the mean vector of x as the expansion point so that ji is the row vector of partial derivatives evaluated at Then gi x gi ji x From this we obtain E gi x gi Var gi x j j
    i i

    B 89

    B 90 B 91

    and Cov gi x g j x ji j j B 92

    These results can be collected in a convenient form by arranging the row vectors ji in a matrix J Then corresponding to the preceding equations we have E g x Var g x g J J B 93 B 94

    The matrix J in the last preceding line is y x evaluated at x

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    871

    B 11

    THE MULTIVARIATE NORMAL DISTRIBUTION

    The foundation of most multivariate analysis in econometrics is the multivariate normal distribution Let the vector x1 x2 xn x be the set of n random variables their mean vector and their covariance matrix The general form of the joint density is f x 2 n 2 1 2 e 1 2 x
    1 x



    B 95

    If R is the correlation matrix of the variables and Ri j i j i j then f x 2 n 2 1 2 n 1 R 1 2 e 1 2 R
    1



    B 96

    where i xi i i 8 Two special cases are of interest If all the variables are uncorrelated then i j 0 for i j Thus R I and the density becomes f x 2 n 2 1 2 n 1 e 2
    n

    f x1 f x2 f xn
    i 1

    f xi

    B 97

    As in the bivariate case if normally distributed variables are uncorrelated then they are independent If i and 0 then xi N 0 2 and ei xi and the density becomes f x 2 n 2 2 n 2 e x x 2 Finally if 1 f x 2 n 2 e x x 2 This distribution is the multivariate standard normal or spherical normal distribution
    B 11 1 MARGINAL AND CONDITIONAL NORMAL DISTRIBUTIONS
    2

    B 98

    B 99

    Let x1 be any subset of the variables including a single variable and let x2 be the remaining variables Partition and likewise so that 1 2 and
    11 21 12 22



    Then the marginal distributions are also normal In particular we have the following theorem

    THEOREM B 7 Marginal and Conditional Normal Distributions
    If x1 x2 have a joint multivariate normal distribution then the marginal distributions are x1 N 1
    11

    B 100

    result is obtained by constructing the diagonal matrix with i as its i th diagonal element Then 1 which implies that 1 1 R 1 1 Inserting this in B 95 yields B 96 Note that the R 1 i th element of 1 x is xi i i
    8 This

    Greene 50240

    book

    June 28 2002

    14 40

    872

    APPENDIX B Probability and Distribution Theory

    THEOREM B 7 Continued
    and x2 N 2
    22

    B 101

    The conditional distribution of x1 given x2 is normal as well x1 x2 N 1 2 where 1 2 1
    11 2 12 1 22 x2 1 22 11 2

    B 102

    2
    21

    B 102a B 102b



    11



    12

    Proof We partition and as shown above and insert the parts in B 95 To construct the density we use 2 72 to partition the determinant and A 74 to partition the inverse
    1 11 21 12 22 22 11



    12

    1 22

    21





    1 11 2 B 1 2 11


    1 22



    1 11 2 B B 1 2 B 11



    For simplicity we let B
    12 1 22

    Inserting these in B 95 and collecting terms produces the joint density as a product of two terms f x1 x2 f1 2 x1 x2 f2 x2 The rst of these is a normal distribution with mean 1 2 and variance second is the marginal distribution of x2
    11 2

    whereas the

    The conditional mean vector in the multivariate normal distribution is a linear function of the unconditional mean and the conditioning variables and the conditional covariance matrix is constant and is smaller in the sense discussed in Section A 7 3 than the unconditional covariance matrix Notice that the conditional covariance matrix is the inverse of the upper left block of 1 that is this matrix is of the form shown in A 74 for the partitioned inverse of a matrix

    B 11 2

    THE CLASSICAL NORMAL LINEAR REGRESSION MODEL

    An important special case of the preceding is that in which x1 is a single variable y and x2 is K variables x Then the conditional distribution is a multivariate version of that in B 80 with 1 xy where xy is the vector of covariances of y with x2 Recall that any random variable xx y can be written as its mean plus the deviation from the mean If we apply this tautology to the multivariate normal we obtain y E y x y E y x x

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    873

    where is given above y x and has a normal distribution We thus have in this multivariate normal distribution the classical normal linear regression model
    B 11 3 LINEAR FUNCTIONS OF A NORMAL VECTOR

    Any linear function of a vector of joint normally distributed variables is also normally distributed The mean vector and covariance matrix of Ax where x is normally distributed follow the general pattern given earlier Thus If x N then Ax b N A b A A B 103

    If A does not have full rank then A A is singular and the density does not exist in the full dimensional space of x though it does exist in the subspace of dimension equal to the rank of Nonetheless the individual elements of Ax b will still be normally distributed and the joint distribution of the full vector is still a multivariate normal
    B 11 4 QUADRATIC FORMS IN A STANDARD NORMAL VECTOR

    The earlier discussion of the chi squared distribution gives the distribution of x x if x has a standard normal distribution It follows from A 36 that
    n n

    xx
    i 1

    xi2
    i 1

    xi x 2 nx 2

    B 104

    We know from B 32 that x x has a chi squared distribution It seems natural therefore to invoke B 34 for the two parts on the right hand side of B 104 It is not yet obvious however that either of the two terms has a chi squared distribution or that the two terms are independent as required To show these conditions it is necessary to derive the distributions of idempotent quadratic forms and to show when they are independent To begin the second term is the square of nx which can easily be shown to have a standard normal distribution Thus the second term is the square of a standard normal variable and has chisquared distribution with one degree of freedom But the rst term is the sum of n nonindependent variables and it remains to be shown that the two terms are independent

    DEFINITION B 3 Orthonormal Quadratic Form
    A particular case of B 103 is the following If x N 0 I and C is a square matrix such that C C I then C x N 0 I

    Consider then a quadratic form in a standard normal vector x with symmetric matrix A q x Ax Let the characteristic roots and vectors of A be arranged in a diagonal matrix matrix C as in Section A 6 3 Then q x C C x B 105 and an orthogonal B 106

    By de nition C satis es the requirement that C C I Thus the vector y C x has a standard

    Greene 50240

    book

    June 28 2002

    14 40

    874

    APPENDIX B Probability and Distribution Theory

    normal distribution Consequently
    n

    q y If i is always one or zero then

    y
    i 1

    i yi2

    B 107

    J

    q
    j 1

    yi2

    B 108

    which has a chi squared distribution The sum is taken over the j 1 J elements associated with the roots that are equal to one A matrix whose characteristic roots are all zero or one is idempotent Therefore we have proved the next theorem

    THEOREM B 8 Distribution of an Idempotent Quadratic Form in a Standard Normal Vector
    If x N 0 I and A is idempotent then x Ax has a chi squared distribution with degrees of freedom equal to the number of unit roots of A which is equal to the rank of A

    The rank of a matrix is equal to the number of nonzero characteristic roots it has Therefore the degrees of freedom in the preceding chi squared distribution equals J the rank of A We can apply this result to the earlier sum of squares The rst term is
    n

    xi x 2 x M0 x
    i 1

    where M0 was de ned in A 34 as the matrix that transforms data to mean deviation form M0 I 1 ii n

    Since M0 is idempotent the sum of squared deviations from the mean has a chi squared distribution The degrees of freedom equals the rank M0 which is not obvious except for the useful result in A 108 that



    The rank of an idempotent matrix is equal to its trace

    B 109

    Each diagonal element of M0 is 1 1 n hence the trace is n 1 1 n n 1 Therefore we have an application of Theorem B 8



    If x N 0 I

    n x i 1 i

    x 2 2 n 1

    B 110

    We have already shown that the second term in B 104 has a chi squared distribution with one degree of freedom It is instructive to set this up as a quadratic form as well nx 2 x 1 ii x x jj x where j n 1 i n B 111

    The matrix in brackets is the outer product of a nonzero vector which always has rank one You can verify that it is idempotent by multiplication Thus x x is the sum of two chi squared variables

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX B Probability and Distribution Theory

    875

    one with n 1 degrees of freedom and the other with one It is now necessary to show that the two terms are independent To do so we will use the next theorem

    THEOREM B 9 Independence of Idempotent Quadratic Forms
    If x N 0 I and x Ax and x Bx are two idempotent quadratic forms in x then x Ax and B 112 x Bx are independent if AB 0

    As before we show the result for the general case and then specialize it for the example Since both A and B are symmetric and idempotent A A A and B B B The quadratic forms are therefore x Ax x A Ax x1 x1 where x1 Ax and x Bx x2 x2 where x2 Bx B 113

    Both vectors have zero mean vectors so the covariance matrix of x1 and x2 is E x1 x2 AIB AB 0 Since Ax and Bx are linear functions of a normally distributed random vector they are in turn normally distributed Their zero covariance matrix implies that they are statistically independent 9 which establishes the independence of the two quadratic forms For the case of x x the two matrices are M0 and I M0 You can show that M0 I M0 0 just by multiplying
    B 11 5 THE F DISTRIBUTION

    The normal family of distributions chi squared F and t can all be derived as functions of idempotent quadratic forms in a standard normal vector The F distribution is the ratio of two independent chi squared variables each divided by its respective degrees of freedom Let A and B be two idempotent matrices with ranks ra and rb and let AB 0 Then x Ax ra F ra rb x Bx rb If Var x 2 I instead then this is modi ed to x Ax 2 ra F ra rb x Bx 2 rb
    B 11 6 A FULL RANK QUADRATIC FORM

    B 114

    B 115

    Finally consider the general case x N We are interested in the distribution of q x
    9 Note 1

    x

    B 116

    that both x1 Ax and x2 Bx have singular covariance matrices Nonetheless every element of x1 is independent of every element x2 so the vectors are independent

    Greene 50240

    book

    June 28 2002

    14 40

    876

    APPENDIX B Probability and Distribution Theory

    First the vector can be written as z x and Therefore we seek the distribution of q z
    1

    is the covariance matrix of z as well as of x
    1

    z z Var z

    z

    B 117

    where z is normally distributed with mean 0 This equation is a quadratic form but not necessarily in an idempotent matrix 10 Since is positive de nite it has a square root De ne the symmetric matrix 1 2 so that 1 2 1 2 Then
    1



    1 2

    1 2

    and z
    1

    z z

    1 2 1 2

    1 2

    z z

    z

    1 2

    w w Now w Az so E w A E z 0 and Var w A A This provides the following important result
    1 2 1 2



    0

    I

    THEOREM B 10 Distribution of a Standardized Normal Vector
    If x N then
    1 2

    x N 0 I

    The simplest special case is that in which x has only one variable so that the transformation is just x Combining this case with B 32 concerning the sum of squares of standard normals we have the following theorem

    THEOREM B 11 Distribution of x
    If x N then x
    1

    1

    x When x Is Normal

    x n
    2

    B 11 7

    INDEPENDENCE OF A LINEAR AND A QUADRATIC FORM

    The t distribution is used in many forms of hypothesis tests In some situations it arises as the ratio of a linear to a quadratic form in a normal vector To establish the distribution of these statistics we use the following result
    10 It

    will be idempotent only in the special case of

    I

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX C Estimation and Inference

    877

    THEOREM B 12 Independence of a Linear and a Quadratic Form
    A linear function Lx and a symmetric idempotent quadratic form x Ax in a standard normal vector are statistically independent if LA 0

    The proof follows the same logic as that for two quadratic forms Write x Ax as x A Ax Ax Ax The covariance matrix of the variables Lx and Ax is LA 0 which establishes the independence of these two random vectors The independence of the linear function and the quadratic form follows since functions of independent random vectors are also independent The t distribution is de ned as the ratio of a standard normal variable to the square root of a chi squared variable divided by its degrees of freedom t J A particular case is t n 1
    1 n 1

    N 0 1 2 J J
    1 2



    nx x 2
    1 2



    n x i 1 i

    nx s

    where s is the standard deviation of the values of x The distribution of the two variables in t n 1 was shown earlier we need only show that they are independent But and s2 x M0 x n 1 1 nx i x j x n

    It suf ces to show that M0 j 0 which follows from M0 i I i i i 1 i i i i i i 1 i i 0

    APPENDIX C

    Q
    ESTIMATION AND INFERENCE
    C 1 INTRODUCTION
    The probability distributions discussed in Appendix B serve as models for the underlying data generating processes that produce our observed data The goal of statistical inference in econometrics is to use the principles of mathematical statistics to combine these theoretical distributions and the observed data into an empirical model of the economy This analysis takes place in one of two frameworks classical or Bayesian The overwhelming majority of empirical study in econometrics

    Greene 50240

    book

    June 28 2002

    14 40

    878

    APPENDIX C Estimation and Inference

    has been done in the classical framework Our focus therefore will be on classical methods of inference Bayesian methods will be discussed in Chapter 16 1

    C 2

    SAMPLES AND RANDOM SAMPLING

    The classical theory of statistical inference centers on rules for using the sampled data effectively These rules in turn are based on the properties of samples and sampling distributions A sample of n observations on one or more variables denoted x1 x2 xn is a random sample if the n observations are drawn independently from the same population or probability distribution f xi The sample may be univariate if xi is a single random variable or multivariate if each observation contains several variables A random sample of observations denoted x1 x2 xn or xi i 1 n is said to be independent identically distributed which we denote i i d The vector contains one or more unknown parameters Data are generally drawn in one of two settings A cross section is a sample of a number of observational units all drawn at the same point in time A time series is a set of observations drawn on the same observational unit at a number of usually evenly spaced points in time Many recent studies have been based on time series cross sections which generally consist of the same cross sectional units observed at several points in time Since the typical data set of this sort consists of a large number of cross sectional units observed at a few points in time the common term panel data set is usually more tting for this sort of study

    C 3

    DESCRIPTIVE STATISTICS

    Before attempting to estimate parameters of a population or t models to data we normally examine the data themselves In raw form the sample data are a disorganized mass of information so we will need some organizing principles to distill the information into something meaningful Consider rst examining the data on a single variable In most cases and particularly if the number of observations in the sample is large we shall use some summary statistics to describe the sample data Of most interest are measures of location that is the center of the data and scale or the dispersion of the data A few measures of central tendency are as follows mean x 1 n
    n

    xi
    i 1

    median M middle ranked observation maximum minimum sample midrange midrange 2 The dispersion of the sample observations is usually measured by the standard deviation sx
    n i 1

    C 1

    xi x 2 n 1

    1 2



    C 2

    Other measures such as the average absolute deviation from the sample mean are also used although less frequently than the standard deviation The shape of the distribution of values
    1 An excellent reference is Leamer 1978 A summary of the results as they apply to econometrics is contained

    in Zellner 1971 and in Judge et al 1985 See as well Poirier 1991 A recent textbook with a heavy Bayesian emphasis is Poirier 1995

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX C Estimation and Inference

    879

    is often of interest as well Samples of income or expenditure data for example tend to be highly skewed while nancial data such as asset returns and exchange rate movements are more symmetrically distributed relatively more but widely dispersed than more other variables that might be observed Two measures used to quantify these effects are the skewness
    n xi x 3 i 1 3 sx n 1

    and

    kurtosis

    n xi x 4 i 1 4 sx n 1

    Benchmark values for these two measures are zero for a symmetric distribution and three for one which is normally dispersed The skewness coef cient has a bit less of the intuitive appeal of the mean and standard deviation and the kurtosis measure has very little at all The box and whisker plot is a graphical device which is often used to capture a large amount of information about the sample in a simple visual display This plot shows in a gure the median the range of values contained in the 25th and 75th percentile some limits that show the normal range of values expected such as the median plus and minus two standard deviations and in isolation values that could be viewed as outliers A box and whisker plot is shown for the income variable in Example C 1 If the sample contains data on more than one variable we will also be interested in measures of association among the variables A scatter diagram is useful in a bivariate sample if the sample contains a reasonable number of observations Figure C 1 shows an example for a small data set If the sample is a multivariate one then the degree of linear association among the variables can be measured by the pairwise measures covariance sxy correlation r xy
    n i 1

    xi x yi y n 1

    C 3

    sxy sx s y

    If the sample contains data on several variables then it is sometimes convenient to arrange the covariances or correlations in a covariance matrix S si j or correlation matrix R ri j Some useful algebraic results for any two variables xi yi i 1 n and constants a and b are
    2 sx n i 1

    C 4

    xi2 nx 2 n 1 xi yi nxy n 1

    C 5 C 6

    sxy

    n i 1

    1 r xy 1 rax by ab r xy ab a b 0 C 7

    sax a sx sax by ab sxy

    C 8

    Greene 50240

    book

    June 28 2002

    14 40

    880

    APPENDIX C Estimation and Inference

    TABLE C 1 Range

    Income Distribution
    Relative Frequency Cumulative Frequency

    10 000 10 000 25 000 25 000 50 000 50 000

    0 15 0 30 0 40 0 15

    0 15 0 45 0 85 1 00

    Note that these algebraic results parallel the theoretical results for bivariate probability distributions We note in passing while the formulas in C 2 and C 5 are algebraically the same C 2 will generally be more accurate in practice especially when the values in the sample are very widely dispersed The statistics described above will provide the analyst with a more concise description of the data than a raw tabulation However we have not as yet suggested that these measures correspond to some underlying characteristic of the process that generated the data We do assume that there is an underlying mechanism the data generating process that produces the data in hand Thus these
    Example C 1 Descriptive Statistics for a Random Sample

    Table C 1 is a hypothetical sample of observations on income and education A scatter diagram appears in Figure C 1 It suggests a weak positive association between income and education in these data The box and whisker plot for income at the left of the scatter plot shows the distribution of the income data as well 20 5 31 5 47 7 26 2 44 0 8 28 30 8 1 17 2 19 9 9 96 55 8 25 2 29 0 85 5 31 278 Means I 20 15 1 28 5 21 4 17 7 6 42 84 9 E 1 12 16 18 16 12 12 16 12 10 12 14 600 20 16 20 12 16 10 18 16 20 12 16





    Standard deviations sI sE Covariance sI E Correlation r I E
    1 20 5 19 1 12 19

    31 278 2 84 9 31 278 2 22 376

    14 6 2 16 14 6 2 3 119 84 9 16 20 31 28 14 6 23 597

    1 20 5 12 19

    23 597 0 3382 22 376 3 119

    The positive correlation is consistent with our observation in the scatter diagram serve to do more than describe the data they characterize that process or population Since we have assumed that there is an underlying probability distribution it might be useful to produce a statistic that gives a broader view of the DGP The histogram is a simple graphical device that produces this result see Examples C 3 and C 4 for applications For small samples or widely dispersed data however histograms tend to be rough and dif cult to make informative A burgeoning literature see e g Pagan and Ullah 1999 has demonstrated the usefulness of the

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX C Estimation and Inference

    881

    90 80 70 Income in thousands 60 50 40 30 20 10 0

    10

    12

    14 16 Education

    18

    20

    FIGURE C 1

    Box and Whisker Plot for Income and Scatter Diagram for Income and Education

    kernel density estimator as a substitute for the histogram as a descriptive tool for the underlying distribution that produced a sample of data The underlying theory of the kernel density estimator is fairly complicated but the computations are surprisingly simple The estimator is computed using f x 1 nh
    n

    K
    i 1

    xi x h

    where x1 xn are the n observations in the sample f x denotes the estimated density function x is the value at which we wish to evaluate the density and h and K are the bandwidth and kernel function which we now consider The density estimator is rather like a histogram in which the bandwidth is the width of the intervals The kernel function is a weight function which is generally chosen so that it takes large values when x is close to xi and tapers off to zero in as they diverge in either direction The weighting function used in the example below is the logistic density discussed in Section B 4 7 The bandwidth is chosen to be a function of 1 n so that the intervals can become narrower as the sample becomes larger and richer The one used below is h 9Min s range 3 n 2 We will revisit this method of estimation in Chapter 16 Example C 2 below illustrates the computation for the income data used in Example C 1
    Example C 2 Kernel Density Estimator for the Income Data

    The following Figure C 2 suggests the large skew in the income data that is also suggested by the box and whisker plot and the scatter plot in Example 4 1

    Greene 50240

    book

    June 28 2002

    14 40

    882

    APPENDIX C Estimation and Inference

    Kernel Density Estimate for Income 020

    015

    Density

    010

    005

    000 0
    FIGURE C 2

    20

    40 60 K INCOME

    80

    100

    Kernel Density Estimator

    C 4

    STATISTICS AS ESTIMATORS SAMPLING DISTRIBUTIONS

    The measures described in the preceding section summarize the data in a random sample Each measure has a counterpart in the population that is the distribution from which the data were drawn Sample quantities such as the means and the correlation coef cient correspond to population expectations whereas the kernel density estimator and the values in Table C 1 parallel the population pdf and cdf In the setting of a random sample we expect these quantities to mimic the population although not perfectly The precise manner in which these quantities re ect the population values de nes the sampling distribution of a sample statistic

    DEFINITION C 1 Statistic
    A statistic is any function computed from the data in a sample

    If another sample were drawn under identical conditions different values would be obtained for the observations as each one is a random variable Any statistic is a function of these random values so it is also a random variable with a probability distribution called a sampling distribution For example the following shows an exact result for the sampling behavior of a widely used statistic

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX C Estimation and Inference

    883

    THEOREM C 1 Sampling Distribution of the Sample Mean
    If x1 xn are a random sample from a population with mean and variance 2 then x is a random variable with mean and variance 2 n Proof x 1 n i xi E x 1 n i The observations are independent so Var x 1 n 2 Var i xi 1 n2 i 2 2 n

    Example C 3 illustrates the behavior of the sample mean in samples of four observations drawn from a chi squared population with one degree of freedom The crucial concepts illustrated in this example are rst the mean and variance results in Theorem 4 1 and second the phenomenon of sampling variability Notice that the fundamental result in Theorem C 1 does not assume a distribution for xi Indeed looking back at Section C 3 nothing we have done so far has required any assumption about a particular distribution
    Example C 3 Sampling Distribution of a Sample Mean

    Figure C 3 shows a frequency plot of the means of 1 000 random samples of four observations drawn from a chi squared distribution with one degree of freedom which has mean 1 and variance 2 We are often interested in how a statistic behaves as the sample size increases Example C 4 illustrates one such case Figure C 4 shows two sampling distributions one based on samples of three and a second of the same statistic but based on samples of six The effect of increasing sample size in this gure is unmistakable It is easy to visualize the behavior of this statistic if we extrapolate the experiment in Example C 4 to samples of say 100
    Example C 4

    If x1 xn are a random sample from an exponential distribution with f x e x then the sampling distribution of the sample minimum in a sample of n observations denoted x 1 is f x 1 n e n x 1

    Sampling Distribution of the Sample Minimum

    Since E x 1 and Var x 1 2 by analogy E x 1 1 n and Var x 1 1 n 2 Thus in increasingly larger samples the minimum will be arbitrarily close to 0 The Chebychev inequality in Theorem D 2 can be used to prove this intuitively appealing result Figure C 4 shows the results of a simple sampling experiment you can do to demonstrate this effect It requires software that will allow you to produce pseudorandom numbers uniformly distributed in the range zero to one and that will let you plot a histogram and control the axes We used EA LimDep This can be done with Stata Excel or several other packages The experiment consists of drawing 1 000 sets of nine random values Ui j i 1 1 000 j 1 9 To transform these uniform draws to exponential with parameter we used 1 5 use the inverse probability transform see Section 11 3 For an exponentially distributed variable the transformation is zi j 1 log 1 Ui j We then created z 1 3 from the rst three draws and z 1 6 from the other six The two histograms show clearly the effect on the sampling distribution of increasing sample size from just 3 to 6 Sampling distributions are used to make inferences about the population To consider a perhaps obvious example because the sampling distribution of the mean of a set of normally distributed observations has mean the sample mean is a natural candidate for an estimate of The observation that the sample mimics the population is a statement about the sampling

    Greene 50240

    book

    June 28 2002

    14 40

    884
    80 75 70 65 60 55 50 45 Frequency 40 35 30 25 20 15 10 5 0

    APPENDIX C Estimation and Inference

    Mean 0 9038 Variance 0 5637

    0 0 1

    0 2 0 3

    0 4 0 5

    0 6 0 7

    0 8 0 9

    1 0 1 1

    1 2

    1 4 1 6 1 8 2 0 2 2 2 4 2 6 2 8 3 0 1 3 1 5 1 7 1 9 2 1 2 3 2 5 2 7 2 9 3 1 Sample mean

    FIGURE C 3

    Sampling Distribution of Means of 1 000 Samples of Size 4 from Chi Squared 1

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX C Estimation and Inference

    885

    200

    372

    150 Frequency Frequency 000 186 371 557 743 929 1 114 1 300 Minimum of 3 Observations
    FIGURE C 4

    279

    100

    186

    50

    93

    0

    0 000 186 371 557 743 929 1 114 1 300 Minimum of 6 Observations

    Histograms of the Sample Minimum of 3 and 6 Observations

    distributions of the sample statistics Consider for example the sample data collected in Figure C 3 The sample mean of four observations clearly has a sampling distribution which appears to have a mean roughly equal to the population mean Our theory of parameter estimation departs from this point

    C 5

    POINT ESTIMATION OF PARAMETERS

    Our objective is to use the sample data to infer the value of a parameter or set of parameters which we denote A point estimate is a statistic computed from a sample that gives a single value for The standard error of the estimate is the standard deviation of the sampling distribution of the statistic the square of this quantity is the sampling variance An interval estimate is a range of values that will contain the true parameter with a preassigned probability There will be a connection between the two types of estimates generally if is the point estimate then the interval estimate will be a measure of sampling error An estimator is a rule or strategy for using the data to estimate the parameter It is de ned before the data are drawn Obviously some estimators are better than others To take a simple example your intuition should convince you that the sample mean would be a better estimator of the population mean than the sample minimum the minimum is almost certain to underestimate the mean Nonetheless the minimum is not entirely without virtue it is easy to compute which is occasionally a relevant criterion The search for good estimators constitutes much of econometrics Estimators are compared on the basis of a variety of attributes Finite sample properties of estimators are those attributes that can be compared regardless of the sample size Some estimation problems involve characteristics that are not known in nite samples In these instances estimators are compared on the basis on their large sample or asymptotic properties We consider these in turn
    C 5 1 ESTIMATION IN A FINITE SAMPLE

    The following are some nite sample estimation criteria for estimating a single parameter The extensions to the multiparameter case are direct We shall consider them in passing where necessary

    Greene 50240

    book

    June 28 2002

    14 40

    886

    APPENDIX C Estimation and Inference

    DEFINITION C 2 Unbiased Estimator
    An estimator of a parameter is unbiased if the mean of its sampling distribution is Formally E or E Bias 0 implies that is unbiased Note that this implies that the expected sampling error is zero If is a vector of parameters then the estimator is unbiased if the expected value of every element of equals the corresponding element of

    If samples of size n are drawn repeatedly and is computed for each one then the average value of these estimates will tend to equal For example the average of the 1 000 sample means underlying Figure C 2 is 0 9038 which is reasonably close to the population mean of one The sample minimum is clearly a biased estimator of the mean it will almost always underestimate the mean so it will do so on average as well Unbiasedness is a desirable attribute but it is rarely used by itself as an estimation criterion One reason is that there are many unbiased estimators that are poor uses of the data For example in a sample of size n the rst observation drawn is an unbiased estimator of the mean that clearly wastes a great deal of information A second criterion used to choose among unbiased estimators is ef ciency

    DEFINITION C 3 Ef cient Unbiased Estimator
    An unbiased estimator 1 is more ef cient than another unbiased estimator 2 if the sam pling variance of 1 is less than that of 2 That is Var 1 Var 2 In the multiparameter case the comparison is based on the covariance matrices of the two estimators 1 is more ef cient than 2 if Var 2 Var 1 is a positive de nite matrix

    By this criterion the sample mean is obviously to be preferred to the rst observation as an estimator of the population mean If 2 is the population variance then Var x1 2 Var x 2 n

    In discussing ef ciency we have restricted the discussion to unbiased estimators Clearly there are biased estimators that have smaller variances than the unbiased ones we have considered Any constant has a variance of zero Of course using a constant as an estimator is not likely to be an effective use of the sample data Focusing on unbiasedness may still preclude a tolerably biased estimator with a much smaller variance however A criterion that recognizes this possible tradeoff is the mean squared error

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX C Estimation and Inference

    887

    DEFINITION C 4 Mean Squared Error
    The mean squared error of an estimator is MSE E 2 Var Bias
    2

    if is a scalar if is a vector

    C 9

    MSE Var Bias Bias

    Figure C 5 illustrates the effect On average the biased estimator will be closer to the true parameter than will the unbiased estimator Which of these criteria should be used in a given situation depends on the particulars of that setting and our objectives in the study Unfortunately the MSE criterion is rarely operational minimum mean squared error estimators when they exist at all usually depend on unknown parameters Thus we are usually less demanding A commonly used criterion is minimum variance unbiasedness
    Example C 5

    In sampling from a normal distribution the most frequently used estimator for 2 is s2
    n x i 1 i

    Mean Squared Error of the Sample Variance

    x 2 n 1

    It is straightforward to show that s2 is unbiased so Var s2 2 4 MSE s2 2 n 1

    FIGURE C 5

    Sampling Distributions
    unbiased biased

    Density

    Estimator

    Greene 50240

    book

    June 28 2002

    14 40

    888

    APPENDIX C Estimation and Inference

    A proof is based on the distribution of the idempotent quadratic form x i M0 x i which we discussed in Section B11 4 A less frequently used estimator is 2 1 n
    n

    xi x 2 n 1 n s2
    i 1

    This estimator is slightly biased downward E 2 so its bias is E 2 2 Bias 2 2 But it has a smaller variance than s2 Var 2 n 1 n
    2

    n 1 2 n 1 E s 2 n n 1 2 n

    2 4 Var s2 n 1

    To compare the two estimators we can use the difference in their mean squared errors MSE 2 2 MSE s2 2 4 2 2n 1 0 n2 n 1

    The biased estimator is a bit more precise The difference will be negligible in a large sample but for example it is about 1 2 percent in a sample of 16
    C 5 2 EFFICIENT UNBIASED ESTIMATION

    In a random sample of n observations the density of each observation is f xi Since the n observations are independent their joint density is f x1 x2 xn f x1 f x2 f xn
    n


    i 1

    f xi L x1 x2 xn

    C 10

    This function denoted L X is the likelihood function for given the data X It is frequently abbreviated to L Where no ambiguity can arise we shall abbreviate it further to L
    Example C 6 Likelihood Functions for Exponential and Normal Distributions

    If x1 xn are a sample of n observations from an exponential distribution with parameter then
    n

    L
    i 1

    e x i n e



    n i 1

    xi



    If x1 xn are a sample of n observations from a normal distribution with mean and standard deviation then
    n

    L
    i 1

    2 2 1 2 e 1 2

    2 x 2 i

    C 11



    2 2 2 2 n 2 e 1 2 i xi

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX C Estimation and Inference

    889

    The likelihood function is the cornerstone for most of our theory of parameter estimation An important result for ef cient estimation is the following

    THEOREM C 2 Cramer Rao Lower Bound
    Assuming that the density of x satis es certain regularity conditions the variance of an unbiased estimator of a parameter will always be at least as large as I
    1



    2 ln L E 2

    1



    E

    ln L

    2

    1



    C 12

    The quantity I is the information number for the sample We will prove the result that the negative of the expected second derivative equals the expected square of the rst derivative in the next section Proof of the main result of the theorem is quite involved See for example Stuart and Ord 1989

    The regularity conditions are technical in nature See Theil 1971 Chap 8 Loosely they are conditions imposed on the density of the random variable that appears in the likelihood function these conditions will ensure that the Lindberg Levy central limit theorem will apply to the sample of observations on the random vector y ln f x Among the conditions are nite moments of x up to order 3 An additional condition normally included in the set is that the range of the random variable be independent of the parameters In some cases the second derivative of the log likelihood is a constant so the Cramer Rao bound is simple to obtain For instance in sampling from an exponential distribution from Example C 6
    n

    ln L n ln
    i 1

    xi

    n ln L

    n

    xi
    i 1

    so 2 ln L 2 n 2 and the variance bound is I 1 2 n In most situations the second derivative is a random variable with a distribution of its own The following examples show two such cases
    Example C 7 Variance Bound for the Poisson Distribution

    For the Poisson distribution f x e x x
    n n

    ln L n
    i 1

    xi
    n i 1

    ln
    i 1

    ln xi

    ln L n 2 ln L 2
    n i 1 2

    xi





    xi



    Greene 50240

    book

    June 28 2002

    14 40

    890

    APPENDIX C Estimation and Inference

    The sum of n identical Poisson variables has a Poisson distribution with parameter equal to n times the parameter of the individual variables Therefore the actual distribution of the rst derivative will be that of a linear function of a Poisson distributed variable Since n E i 1 xi nE xi n the variance bound for the Poisson distribution is I 1 n Note also that the same result implies that E ln L 0 which is a result we will use in Chapter 17 The same result holds for the exponential distribution Consider nally a multivariate case If is a vector of parameters then I is the information matrix The Cramer Rao theorem states that the difference between the covariance matrix of any unbiased estimator and the inverse of the information matrix I 1 E 2 ln L
    1



    E

    ln L

    ln L

    1



    C 13

    will be a nonnegative de nite matrix In most settings numerous estimators are available for the parameters of a distribution The usefulness of the Cramer Rao bound is that if one of these is known to attain the variance bound then there is no need to consider any other to seek a more ef cient estimator Regarding the use of the variance bound we emphasize that if an unbiased estimator attains it then that estimator is ef cient If a given estimator does not attain the variance bound however then we do not know except in a few special cases whether this estimator is ef cient or not It may be that no unbiased estimator can attain the Cramer Rao bound which can leave the question of whether a given unbiased estimator is ef cient or not unanswered We note nally that in some cases we further restrict the set of estimators to linear functions of the data

    DEFINITION C 5 Minimum Variance Linear Unbiased Estimator MVLUE
    An estimator is the minimum variance linear unbiased estimator or best linear unbiased estimator BLUE if it is a linear function of the data and has minimum variance among linear unbiased estimators

    In a few instances such as the normal mean there will be an ef cient linear unbiased estimator x is ef cient among all unbiased estimators both linear and nonlinear In other cases such as the normal variance there is no linear unbiased estimator This criterion is useful because we can sometimes nd an MVLUE without having to specify the distribution at all Thus by limiting ourselves to a somewhat restricted class of estimators we free ourselves from having to assume a particular distribution

    C 6

    INTERVAL ESTIMATION

    Regardless of the properties of an estimator the estimate obtained will vary from sample to sample and there is some probability that it will be quite erroneous A point estimate will not provide any information on the likely range of error The logic behind an interval estimate is that we use the sample data to construct an interval lower X upper X such that we can expect this interval to contain the true parameter in some speci ed proportion of samples or

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX C Estimation and Inference

    891

    equivalently with some desired level of con dence Clearly the wider the interval the more con dent we can be that it will in any given sample contain the parameter being estimated The theory of interval estimation is based on a pivotal quantity which is a function of both the parameter and a point estimate that has a known distribution Consider the following examples
    Example C 8

    In sampling from a normal distribution with mean and standard deviation n x z t n 1 s n 1 s2 2 n 1 2

    Con dence Intervals for the Normal Mean

    and c

    Given the pivotal quantity we can make probability statements about events involving the parameter and the estimate Let p g be the constructed random variable for example z or c Given a prespeci ed con dence level 1 we can state that Prob lower p g upper 1 C 14

    where lower and upper are obtained from the appropriate table This statement is then manipulated to make equivalent statements about the endpoints of the intervals For example the following statements are equivalent n x Prob z z 1 a s zs zs Prob x x n n 1

    The second of these is a statement about the interval not the parameter that is it is the interval that is random not the parameter We attach a probability or 100 1 percent con dence level to the interval itself in repeated sampling an interval constructed in this fashion will contain the true parameter 100 1 percent of the time In general the interval constructed by this method will be of the form lower X e1 upper X e2 where X is the sample data e1 and e2 are sampling errors and is a point estimate of It is clear from the preceding example that if the sampling distribution of the pivotal quantity is either t or standard normal which will be true in the vast majority of cases we encounter in practice then the con dence interval will be C1 2 se C 15

    where se is the known or estimated standard error of the parameter estimate and C1 2 is the value from the t or standard normal distribution that is exceeded with probability 1 2 The usual values for are 0 10 0 05 or 0 01 The theory does not prescribe exactly how to choose the endpoints for the con dence interval An obvious criterion is to minimize the width of the interval If the sampling distribution is symmetric then the symmetric interval is the best one If the sampling distribution is not symmetric however then this procedure will not be optimal

    Greene 50240

    book

    June 28 2002

    14 40

    892

    APPENDIX C Estimation and Inference Example C 9 Estimated Con dence Intervals for a Normal Mean and Variance

    In a sample of 25 x 1 63 and s 0 51 Construct a 95 percent con dence interval for Assuming that the sample of 25 is from a normal distribution Prob 2 064 5 x 2 064 s 0 95

    where 2 064 is the critical value from a t distribution with 24 degrees of freedom Thus the con dence interval is 1 63 2 064 0 51 5 or 1 4195 1 8405 Remark Had the parent distribution not been speci ed it would have been natural to use the standard normal distribution instead perhaps relying on the central limit theorem But a sample size of 25 is small enough that the more conservative t distribution might still be preferable The chi squared distribution is used to construct a con dence interval for the variance of a normal distribution Using the data from Example 4 29 we nd that the usual procedure would use 24s2 39 4 0 95 Prob 12 4 2 where 12 4 and 39 4 are the 0 025 and 0 975 cutoff points from the chi squared 24 distribution This procedure leads to the 95 percent con dence interval 0 1581 0 5032 By making use of the asymmetry of the distribution a narrower interval can be constructed Allocating 4 percent to the left hand tail and 1 percent to the right instead of 2 5 percent to each the two cutoff points are 13 4 and 42 9 and the resulting 95 percent con dence interval is 0 1455 0 4659 Finally the con dence interval can be manipulated to obtain a con dence interval for a function of a parameter For example based on the preceding a 95 percent con dence interval for would be 0 1581 0 5032 0 3976 0 7094

    C 7

    HYPOTHESIS TESTING

    The second major group of statistical inference procedures is hypothesis tests The classical testing procedures are based on constructing a statistic from a random sample that will enable the analyst to decide with reasonable con dence whether or not the data in the sample would have been generated by a hypothesized population The formal procedure involves a statement of the hypothesis usually in terms of a null or maintained hypothesis and an alternative conventionally denoted H0 and H1 respectively The procedure itself is a rule stated in terms of the data that dictates whether the null hypothesis should be rejected or not For example the hypothesis might state a parameter is equal to a speci ed value The decision rule might state that the hypothesis should be rejected if a sample estimate of that parameter is too far away from that value where far remains to be de ned The classical or Neyman Pearson methodology involves partitioning the sample space into two regions If the observed data i e the test statistic fall in the rejection region sometimes called the critical region then the null hypothesis is rejected if they fall in the acceptance region then it is not
    C 7 1 CLASSICAL TESTING PROCEDURES

    Since the sample is random the test statistic however de ned is also random The same test procedure can lead to different conclusions in different samples As such there are two ways such a procedure can be in error 1 2 Type I error The procedure may lead to rejection of the null hypothesis when it is true Type II error The procedure may fail to reject the null hypothesis when it is false

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX C Estimation and Inference

    893

    To continue the previous example there is some probability that the estimate of the parameter will be quite far from the hypothesized value even if the hypothesis is true This situation might cause a type I error

    DEFINITION C 6 Size of a Test
    The probability of a type I error is the size of the test This is conventionally denoted and is also called the signi cance level

    The size of the test is under the control of the analyst It can be changed just by changing the decision rule Indeed the type I error could be eliminated altogether just by making the rejection region very small but this would come at a cost By eliminating the probability of a type I error that is by making it unlikely that the hypothesis is rejected we must increase the probability of a type II error Ideally we would like both probabilities to be as small as possible It is clear however that there is a tradeoff between the two The best we can hope for is that for a given probability of type I error the procedure we choose will have as small a probability of type II error as possible

    DEFINITION C 7 Power of a Test
    The power of a test is the probability that it will correctly lead to rejection of a false null hypothesis power 1 1 Prob type II error C 16

    For a given signi cance level we would like to be as small as possible Since is de ned in terms of the alternative hypothesis it depends on the value of the parameter
    Example C 10

    For testing H 0 0 in a normal distribution with known variance 2 the decision rule is to reject the hypothesis if the absolute value of the z statistic n x 0 exceeds the predetermined critical value For a test at the 5 percent signi cance level we set the critical value at 1 96 The power of the test therefore is the probability that the absolute value of the test statistic will exceed 1 96 given that the true value of is in fact not 0 This value depends on the alternative value of as shown in Figure C 6 Notice that for this test the power is equal to the size at the point where equals 0 As might be expected the test becomes more powerful the farther the true mean is from the hypothesized value Testing procedures like estimators can be compared using a number of criteria

    Testing a Hypothesis About a Mean

    DEFINITION C 8 Most Powerful Test
    A test is most powerful if it has greater power than any other test of the same size

    Greene 50240

    book

    June 28 2002

    14 40

    894

    APPENDIX C Estimation and Inference

    1 0 1

    0
    0

    FIGURE C 6

    Power Function for a Test

    This requirement is very strong Since the power depends on the alternative hypothesis we might require that the test be uniformly most powerful UMP that is have greater power than any other test of the same size for all admissible values of the parameter There are few situations in which a UMP test is available We usually must be less stringent in our requirements Nonetheless the criteria for comparing hypothesis testing procedures are generally based on their respective power functions A common and very modest requirement is that the test be unbiased

    DEFINITION C 9 Unbiased Test
    A test is unbiased if its power 1 is greater than or equal to its size for all values of the parameter

    If a test is biased then for some values of the parameter we are more likely to accept the null hypothesis when it is false than when it is true The use of the term unbiased here is unrelated to the concept of an unbiased estimator Fortunately there is little chance of confusion Tests and estimators are clearly connected however The following criterion derives in general from the corresponding attribute of a parameter estimate

    DEFINITION C 10 Consistent Test
    A test is consistent if its power goes to one as the sample size grows to in nity

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX C Estimation and Inference Example C 11

    895

    Consistent Test About a Mean A con dence interval for the mean of a normal distribution is x t1 2 s n where x and s are the usual consistent estimators for and n is the sample size and t1 2 is the correct critical value from the t distribution with n 1 degrees of freedom For testing H 0 0 versus H 1 0 let the procedure be to reject H 0 if the con dence interval does not contain 0 Since x is consistent for one can discern if H 0 is false as n with probability 1 because x will be arbitrarily close to the true Therefore this test is consistent

    As a general rule a test will be consistent if it is based on a consistent estimator of the parameter
    C 7 2 TESTS BASED ON CONFIDENCE INTERVALS

    There is an obvious link between interval estimation and the sorts of hypothesis tests we have been discussing here The con dence interval gives a range of plausible values for the parameter Therefore it stands to reason that if a hypothesized value of the parameter does not fall in this range of plausible values then the data are not consistent with the hypothesis and it should be rejected Consider then testing H0 0 H1 0 We form a con dence interval based on as described earlier C1 2 se C1 2 se H0 is rejected if 0 exceeds the upper limit or is less than the lower limit Equivalently H0 is rejected if 0 C1 2 se In words the hypothesis is rejected if the estimate is too far from 0 where the distance is measured in standard error units The critical value is taken from the t or standard normal distribution whichever is appropriate
    Example C 12 Testing a Hypothesis About a Mean with a Con dence Interval

    For the results in Example C 8 test H 0 1 98 versus H 1 1 98 assuming sampling from a normal distribution t x 1 98 1 63 1 98 3 43 0 102 s n

    The 95 percent critical value for t 24 is 2 064 Therefore reject H 0 If the critical value for the standard normal table of 1 96 is used instead then the same result is obtained If the test is one sided as in H0 0 H1 0 then the critical region must be adjusted Thus for this test H0 will be rejected if a point estimate of falls suf ciently below 0 Tests can usually be set up by departing from the decision criterion What sample results are inconsistent with the hypothesis

    Greene 50240

    book

    June 28 2002

    14 40

    896

    APPENDIX D Large Sample Distribution Theory Example C 13 One Sided Test About a Mean

    A sample of 25 from a normal distribution yields x 1 63 and s 0 51 Test H 0 1 5 H 1 1 5

    Clearly no observed x less than or equal to 1 5 will lead to rejection of H 0 Using the borderline value of 1 5 for we obtain n x 1 5 5 1 63 1 5 Prob Prob t24 1 27 s 0 51 This is approximately 0 11 This value is not unlikely by the usual standards Hence at a signi cant level of 0 11 we would not reject the hypothesis
    C 7 3 SPECIFICATION TESTS

    The hypothesis testing procedures just described are known as classical testing procedures In each case the null hypothesis tested came in the form of a restriction on the alternative You can verify that in each application we examined the parameter space assumed under the null hypothesis is a subspace of that described by the alternative For that reason the models implied are said to be nested The null hypothesis is contained within the alternative This approach suf ces for most of the testing situations encountered in practice but there are common situations in which two competing models cannot be viewed in these terms For example consider a case in which there are two completely different competing theories to explain the same observed data Many models for censoring and truncation discussed in Chapter 21 rest upon a fragile assumption of normality for example Testing of this nature requires a different approach from the classical procedures discussed here These are discussed at various points throughout the book for example in Chapter 13 where we study the difference between xed and random effects models

    APPENDIX D

    Q
    LARGE SAMPLE DISTRIBUTION THEORY
    D 1 INTRODUCTION
    Most of this book is about parameter estimation In studying that subject we will usually be interested in determining how best to use the observed data when choosing among competing estimators That in turn requires us to examine the sampling behavior of estimators In a few cases such as those presented in Appendix C and the least squares estimator considered in Chapter 3 we can make broad statements about sampling distributions that will apply regardless of the size of the sample But in most situations it will only be possible to make approximate statements about estimators such as whether they improve as the sample size increases and what can be said about their sampling distributions in large samples as an approximation to the nite samples we actually observe This appendix will collect most of the formal fundamental theorems

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX D Large Sample Distribution Theory

    897

    and results needed for this analysis A few additional results will be developed in the discussion of time series analysis later in the book

    D 2

    LARGE SAMPLE DISTRIBUTION THEORY 1

    In most cases whether an estimator is exactly unbiased or what its exact sampling variance is in samples of a given size will be unknown But we may be able to obtain approximate results about the behavior of the distribution of an estimator as the sample becomes large For example it is well known that the distribution of the mean of a sample tends to approximate normality as the sample size grows regardless of the distribution of the individual observations Knowledge about the limiting behavior of the distribution of an estimator can be used to infer an approximate distribution for the estimator in a nite sample To describe how this is done it is necessary rst to present some results on convergence of random variables
    D 2 1 CONVERGENCE IN PROBABILITY

    Limiting arguments in this discussion will be with respect to the sample size n Let xn be a sequence random variable indexed by the sample size

    DEFINITION D 1 Convergence in Probability
    The random variable xn converges in probability limn Prob xn c 0 for any positive to a constant c if

    Convergence in probability implies that the values that the variable may take that are not close to c become increasingly unlikely as n increases To consider one example suppose that the random variable xn takes two values zero and n with probabilities 1 1 n and 1 n respectively As n increases the second point will become ever more remote from any constant but at the same time will become increasingly less probable In this example xn converges in probability to zero The crux of this form of convergence is that all the mass of the probability distribution becomes concentrated at points close to c If xn converges in probability to c then we write plim xn c D 1

    We will make frequent use of a special case of convergence in probability convergence in mean square or convergence in quadratic mean

    THEOREM D 1 Convergence in Quadratic Mean
    2 2 If xn has mean n and variance n such that the ordinary limits of n and n are c and 0 respectively then xn converges in mean square to c and

    plim xn c

    1A

    comprehensive summary of many results in large sample theory appears in White 2001 The results discussed here will apply to samples of independent observations Time series cases in which observations are correlated are analyzed in Chapters 19 and 20

    Greene 50240

    book

    June 28 2002

    14 40

    898

    APPENDIX D Large Sample Distribution Theory

    A proof of Theorem D 1 can be based on another useful theorem

    THEOREM D 2 Chebychev s Inequality
    If xn is a random variable and c and are constants then Prob xn c E xn c 2 2

    To establish the Chebychev inequality we use another result see Goldberger 1991 p 31

    THEOREM D 3 Markov s Inequality
    If yn is a nonnegative random variable and is a positive constant then Prob yn E yn Proof E yn Prob yn E yn yn Prob yn E yn yn Since yn is nonnegative both terms must be nonnegative so E yn Prob yn E yn yn Since E yn yn must be greater than or equal to E yn Prob yn which is the result

    Now to prove Theorem D 1 let yn be xn c 2 and be 2 in Theorem D 3 Then xn c 2 implies that xn c Finally we will use a special case of the Chebychev inequality where c n so that we have
    2 Prob xn n n 2 2 Taking the limits of n and n in D 2 we see that if n

    D 2

    lim E xn c

    and

    n

    lim Var xn 0

    D 3

    then plim xn c We have shown that convergence in mean square implies convergence in probability Meansquare convergence implies that the distribution of xn collapses to a spike at plim xn as shown in Figure D 1
    Example D 1 Mean Square Convergence of the Sample Minimum in Exponential Sampling

    As noted in Example 4 3 in sampling of n observations from a exponential distribution for the sample minimum x 1
    n

    lim E x 1 lim

    n

    1 0 n

    and
    n

    lim Var x 1 lim

    n

    1 0 n 2

    Therefore plim x 1 0 Note in particular that the variance is divided by n2 Thus this estimator converges very rapidly to 0

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX D Large Sample Distribution Theory

    899

    n

    1000

    Density

    n

    100 n 10

    Estimator
    FIGURE D 1 Quadratic Convergence to a Constant

    Convergence in probability does not imply convergence in mean square Consider the simple example given earlier in which xn equals either zero or n with probabilities 1 1 n and 1 n The exact expected value of xn is 1 for all n which is not the probability limit Indeed if we let Prob xn n2 1 n instead the mean of the distribution explodes but the probability limit is still zero Again the point xn n2 becomes ever more extreme but at the same time becomes ever less likely The conditions for convergence in mean square are usually easier to verify than those for the more general form Fortunately we shall rarely encounter circumstances in which it will be necessary to show convergence in probability in which we cannot rely upon convergence in mean square Our most frequent use of this concept will be in formulating consistent estimators

    DEFINITION D 2 Consistent Estimator
    An estimator n of a parameter is a consistent estimator of if and only if plim n D 4

    THEOREM D 4 Consistency of the Sample Mean
    The mean of a random sample from any population with nite mean and nite variance 2 is a consistent estimator of Proof E x n and Var x n 2 n Therefore x n converges in mean square to or plim x n

    Greene 50240

    book

    June 28 2002

    14 40

    900

    APPENDIX D Large Sample Distribution Theory

    Theorem D 4 is broader than it might appear at rst

    COROLLARY TO THEOREM D 4 Consistency of a Mean of Functions
    In random sampling for any function g x if E g x and Var g x are nite constants then plim 1 n
    n

    g xi E g x
    i 1

    D 5

    Proof De ne yi g xi and use Theorem D 4

    Example D 2

    In sampling from a normal distribution with mean and variance 1 E e x e 1 2 and Var e x e 2 2 e 2 1 See Section B 4 4 on the lognormal distribution Hence 1 n
    n

    Estimating a Function of the Mean

    plim

    e xi e 1 2
    i 1

    D 2 2

    OTHER FORMS OF CONVERGENCE AND LAWS OF LARGE NUMBERS

    Theorem D 4 and the corollary given above are particularly narrow forms of a set of results known as laws of large numbers that are fundamental to the theory of parameter estimation Laws of large numbers come in two forms depending on the type of convergence considered The simpler of these are weak laws of large numbers which rely on convergence in probability as we de ned it above Strong laws rely on a broader type of convergence called almost sure convergence Overall the law of large numbers is a statement about the behavior of an average of a large number of random variables

    THEOREM D 5 Khinchine s Weak Law of Large Numbers
    If xi i 1 n is a random i i d sample from a distribution with nite mean E xi then plim x n Proofs of this and the theorem below are fairly intricate Rao 1973 provides one

    Notice that this is already broader than Theorem D 4 as it does not require that the variance of the distribution be nite On the other hand it is not broad enough since most of the situations we encounter where we will need a result such as this will not involve i i d random sampling A broader result is

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX D Large Sample Distribution Theory

    901

    THEOREM D 6 Chebychev s Weak Law of Large Numbers
    If xi i 1 n is a sample of observations such that E xi i and Var xi i2 such that n n 1 n2 i i2 0 as n then plim x n n 0 2

    There is a subtle distinction between these two theorems that you should notice The Chebychev theorem does not state that x n converges to n or even that it converges to a constant at all That would require a precise statement about the behavior of n The theorem states that as n increases without bound these two quantities will be arbitrarily close to each other that is the difference between them converges to a constant zero This is an important notion that enters the derivation when we consider statistics that converge to random variables instead of to constants What we do have with these two theorems is extremely broad conditions under which a sample mean will converge in probability to its population counterpart The more important difference between the Khinchine and Chebychev theorems is that the second allows for heterogeneity in the distributions of the random variables that enter the mean In analyzing time series data the sequence of outcomes is itself viewed as a random event Consider then the sample mean x n The preceding results concern the behavior of this statistic as n for a particular realization of the sequence x 1 x n But if the sequence itself is viewed as a random event then limit to which x n converges may be also The stronger notion of almost sure convergence relates to this possibility

    DEFINITION D 3 Almost Sure Convergence
    The random variable xn converges almost surely to the constant c if and only if
    n

    lim Prob xi c for all i n 0 for all 0

    Almost sure convergence differs from convergence in probability in an important respect Note that the index in the probability statement is i not n The de nition states that if a sequence converges almost surely then there is an n large enough such that for any positive the probability a s that the sequence will not converge to c goes to zero This is denoted xn x Again it states that the probability of observing a sequence that does not converge to c ultimately vanishes Intuitively it states that once the sequence xn becomes close to c it stays close to c From the two de nitions it is clear that almost sure convergence is a stronger form of convergence Almost sure convergence implies convergence in probability The proof is obvious given the statements of the de nitions The event described in the de nition of almost sure convergence for any i n includes i n which is the condition for convergence in probability Almost sure convergence is used in a stronger form of the law of large numbers

    THEOREM D 7 Kolmogorov s Strong Law of Large Numbers
    If xi i 1 n is a sequence of independently distributed random variables such that E xi i and Var xi i2 such that 2 i 2 as n then i 1 i a s xn n 0

    Greene 50240

    book

    June 28 2002

    14 40

    902

    APPENDIX D Large Sample Distribution Theory

    THEOREM D 8 Markov s Strong Law of Large Numbers
    If zi is a sequence of independent random variables with E zi i and if for some 0 i 1 E zi i 1 i 1 then zn n converges almost surely to 0 which a s we denote zn n 0 2

    The variance condition is satis ed if every variance in the sequence is nite but this is not strictly required it only requires that the variances in the sequence increase at a slow enough rate that the sequence of variances as de ned is bounded The theorem allows for heterogeneity in the means and variances If we return to the conditions of the Khinchine theorem i i d sampling we have a corollary

    COROLLARY TO THEOREM D 8 Kolmogorov
    If xi i 1 n is a sequence of independent and identically distributed random variables a s such that E xi and E xi then xn 0

    Note that the corollary requires identically distributed observations while the theorem only requires independence Finally another form of convergence encountered in the analysis of time series data is convergence in r th mean

    DEFINITION D 4 Convergence in r th Mean
    If xn is a sequence of random variables such that E xn r and limn E xn c r 0 r m then xn converges in r th mean to c This is denoted xn c

    Surely the most common application is the one we met earlier convergence in means square which is convergence in the second mean Some useful results follow from this de nition

    THEOREM D 9 Convergence in Lower Powers
    If xn converges in rth mean to c then xn converges in sth mean to c for any s r The proof uses Jensen s Inequality Theorem D 13 Write E xn c s E xn c r s r s r E xn c r and the inner term converges to zero so the full function must also

    2 The

    use of the expected absolute deviation differs a bit from the expected squared deviation that we have used heretofore to characterize the spread of a distribution Consider two examples If z N 0 2 then E z Prob z 0 E z z 0 Prob z 0 E z z 0 0 7989 See Theorem 22 2 So nite expected absolute value is the same as nite second moment for the normal distribution But if z takes values 0 n with probabilities 1 1 n 1 n then the variance of z is n 1 but E z z is 2 2 n For this case nite expected absolute value occurs without nite expected second moment These are different characterizations of the spread of the distribution

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX D Large Sample Distribution Theory

    903

    THEOREM D 10 Generalized Chebychev s Inequality
    If xn is a random variable and c is a constant such that with E xn c r and is a positive constant then Prob xn c E xn c r r

    We have considered two cases of this result already when r 1 which is the Markov inequality Theorem D 3 and r 2 which is the Chebychev inequality we looked at rst in Theorem D 2

    THEOREM D 11 Convergence in r th mean and Convergence in Probability
    If xn c for any r 0 then xn c The proof relies on Theorem D 9 By assumption limn E xn c r 0 so for some n suf ciently large E xn c r By Theorem D 9 then Prob xn c E xn c r r for any 0 The denominator of the fraction is a xed constant and the numerator converges to zero by our initial assumption so limn Prob xn c 0 which completes the proof
    r m p

    One implication of Theorem D 11 is that although convergence in mean square is a convenient way to prove convergence in probability it is actually stronger than necessary as we get the same result for any positive r Finally we note that we have now shown that both almost sure convergence and convergence in r th mean are stronger than convergence in probability each implies the latter But they themselves are different notions of convergence and neither implies the other

    DEFINITION D 5 Convergence of a Random Vector or Matrix
    Let xn denote a random vector and Xn a random matrix and c and C denote a vector and matrix of constants with the same dimensions as xn and Xn respectively All of the preceding notions of convergence can be extended to xn c and Xn C by applying the results to the respective corresponding elements

    D 2 3

    CONVERGENCE OF FUNCTIONS

    A particularly convenient result is the following

    THEOREM D 12 Slutsky Theorem
    For a continuous function g xn that is not a function of n plim g xn g plim xn D 6

    The generalization of Theorem D 12 to a function of several random variables is direct as illustrated in the next example

    Greene 50240

    book

    June 28 2002

    14 40

    904

    APPENDIX D Large Sample Distribution Theory Example D 3 Probability Limit of a Function of x and s 2

    In random sampling from a population with mean and variance 2 the exact expected value of x 2 sn will be dif cult if not impossible to derive But by the Slutsky theorem n 2 plim x2 n 2 2 2 sn

    An application that highlights the difference between expectation and probability is suggested by the following useful relationships

    THEOREM D 13 Inequalities for Expectations
    Jensen s Inequality If g xn is a concave function of xn then g E xn E g xn Cauchy Schwartz Inequality For two random variables E xy E y2
    1 2

    E x2

    1 2



    Although the expected value of a function of xn may not equal the function of the expected value it exceeds it if the function is concave the probability limit of the function is equal to the function of the probability limit The Slutsky theorem highlights a comparison between the expectation of a random variable and its probability limit Theorem D 12 extends directly in two important directions First though stated in terms of convergence in probability the same set of results applies to convergence in r th mean and almost sure convergence Second so long as the functions are continuous the Slutsky Theorem can be extended to vector or matrix valued functions of random scalars vectors or matrices The following describe some speci c applications Some implications of the Slutsky theorem are now summarized

    THEOREM D 14 Rules for Probability Limits
    If xn and yn are random variables with plimxn c and plim yn d then plim xn yn c d plim xn yn cd plim xn yn c d if d 0 sum rule product rule ratio rule then D 10 D 7 D 8 D 9

    If Wn is a matrix whose elements are random variables and if plim Wn plim W 1 n
    1



    matrix inverse rule

    If Xn and Yn are random matrices with plim Xn A and plim Yn B then plim Xn Yn AB matrix product rule D 11

    D 2 4

    CONVERGENCE TO A RANDOM VARIABLE

    The preceding has dealt with conditions under which a random variable converges to a constant for example the way that a sample mean converges to the population mean In order to develop

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX D Large Sample Distribution Theory

    905

    a theory for the behavior of estimators as a prelude to the discussion of limiting distributions we now consider cases in which a random variable converges not to a constant but to another random variable These results will actually subsume those in the preceding section as a constant may always be viewed as a degenerate random variable that is one with zero variance

    DEFINITION D 6 Convergence in Probability to a Random Variable
    The random variable xn converges in probability to the random variable x if limn Prob xn x 0 for any positive

    As before we write plim xn x to denote this case The interpretation at least the intuition of this type of convergence is different when x is a random variable The notion of closeness de ned here relates not to the concentration of the mass of the probability mechanism generating xn at a point c but to the closeness of that probability mechanism to that of x One can think of this as a convergence of the CDF of xn to that of x

    DEFINITION D 7 Almost Sure Convergence to a Random Variable
    The random variable xn converges almost surely to the random variable x if and only if limn Prob xi x for all i n 0 for all 0

    DEFINITION D 8 Convergence in r th mean to a Random Variable
    The random variable xn converges in rth mean to the random variable x if and only if r m limn E xn x r 0 This is labeled xn x As before the case r 2 is labeled convergence in mean square

    Once again we have to revise our understanding of convergence when convergence is to a random variable

    THEOREM D 15 Convergence of Moments
    Suppose xn x and E x r is nite Then limn E xn r E x r
    r m

    Theorem D 15 raises an interesting question Suppose we let r grow and suppose that xn x and in addition all moments are nite If this holds for any r do we conclude that these random variables have the same distribution The answer to this longstanding problem in probability theory the problem of the sequence of moments is no The sequence of moments does not uniquely determine the distribution Although convergence in r th mean and almost surely still both imply convergence in probability it remains true even with convergence to a random variable instead of a constant that these are different forms of convergence

    r m

    Greene 50240

    book

    June 28 2002

    14 40

    906

    APPENDIX D Large Sample Distribution Theory D 2 5 CONVERGENCE IN DISTRIBUTION LIMITING DISTRIBUTIONS

    A second form of convergence is convergence in distribution Let xn be a sequence of random variables indexed by the sample size and assume that xn has cdf Fn x

    DEFINITION D 9 Convergence in Distribution
    xn converges in distribution to a random variable limn Fn xn F x 0 at all continuity points of F x x with cdf F x if

    This statement is about the probability distribution associated with xn it does not imply that xn converges at all To take a trivial example suppose that the exact distribution of the random variable xn is Prob xn 1 1 1 2 n 1 Prob xn 2 1 1 2 n 1

    As n increases without bound the two probabilities converge to 1 but xn does not converge to a 2 constant

    DEFINITION D 10 Limiting Distribution
    If xn converges in distribution to x where F xn is the cdf of xn then F x is the limiting distribution of x This is written xn x
    d

    The limiting distribution is often given in terms of the pdf or simply the parametric family For example the limiting distribution of xn is standard normal Convergence in distribution can be extended to random vectors and matrices though not in the element by element manner that we extended the earlier convergence forms The reason is that convergence in distribution is a property of the CDF of the random variable not the variable itself Thus we can obtain a convergence result analogous to that in De nition D 9 for vectors or matrices by applying de nition to the joint CDF for the elements of the vector or matrices Thus d xn x if limn Fn xn F x 0 and likewise for a random matrix
    Example D 4 Limiting Distribution of tn 1

    Consider a sample of size n from a standard normal distribution A familiar inference problem is the test of the hypothesis that the population mean is zero The test statistic usually used is the t statistic tn 1 where
    2 sn n x i 1 i

    xn sn n

    xn 2 n 1

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX D Large Sample Distribution Theory

    907

    The exact distribution of the random variable tn 1 is t with n 1 degrees of freedom The density is different for every n f tn 1
    t

    t2 n 2 n 1 1 2 1 n 1 n 1 2 n 1

    n 2

    D 12

    as is the cdf Fn 1 t fn 1 x dx This distribution has mean zero and variance n 1 n 3 As n grows to in nity tn 1 converges to the standard normal which is written tn 1 N 0 1
    d

    DEFINITION D 11 Limiting Mean and Variance
    The limiting mean and variance of a random variable are the mean and variance of the limiting distribution assuming that the limiting distribution and its moments exist

    For the random variable with t n distribution the exact mean and variance are zero and n n 2 whereas the limiting mean and variance are zero and one The example might suggest that the limiting mean and variance are zero and one that is that the moments of the limiting distribution are the ordinary limits of the moments of the nite sample distributions This situation is almost always true but it need not be It is possible to construct examples in which the exact moments do not even exist even though the moments of the limiting distribution are well de ned 3 Even in such cases we can usually derive the mean and variance of the limiting distribution Limiting distributions like probability limits can greatly simplify the analysis of a problem Some results that combine the two concepts are as follows 4

    THEOREM D 16 Rules for Limiting Distributions
    1 If xn x and plim yn c then xn yn cx xn yn x c xn yn x c 2
    d d d d d

    D 13

    which means that the limiting distribution of xn yn is the distribution of cx Also D 14 D 15 if c 0

    If xn x and g xn is a continuous function then g xn g x
    d

    D 16

    This result is analogous to the Slutsky theorem for probability limits For an example consider the tn random variable discussed earlier The exact distribution 2 of tn is F 1 n But as n tn converges to a standard normal variable 2 According to this result the limiting distribution of tn will be that of the square of a standard normal which is chi squared with one

    3 See 4 For

    for example Maddala 1977a p 150

    proofs and further discussion see for example Greenberg and Webster 1983

    Greene 50240

    book

    June 28 2002

    14 40

    908

    APPENDIX D Large Sample Distribution Theory

    THEOREM D 16 Continued
    degree of freedom We conclude therefore that F 1 n chi squared 1
    d

    D 17

    3

    We encountered this result in our earlier discussion of limiting forms of the standard normal family of distributions If yn has a limiting distribution and plim xn yn 0 then xn has the same limiting distribution as yn

    The third result in Theorem D 16 combines convergence in distribution and in probability The second result can be extended to vectors and matrices
    Example D 5

    Suppose that t1 n and t2 n are a K 1 and an M 1 random vector of variables whose components are independent with each distributed as t with n degrees of freedom Then as we saw in the preceding for any component in either random vector the limiting distribution d is standard normal so for the entire vector t j n zn a vector of independent standard d t t K normally distributed variables The results so far show that 1 n 1 n F K M t2 n t2 n M

    The F Distribution

    Finally a speci c case of result 2 in Theorem D 16 produces a tool known as the Cramer Wold device

    THEOREM D 17 Cramer Wold Device
    If xn x then c xn c x for all conformable vectors c with real valued elements
    d d

    By allowing c to be a vector with just a one in a particular position and zeros elsewhere we see that convergence in distribution of a random vector xn to x does imply that each component does likewise
    D 2 6 CENTRAL LIMIT THEOREMS

    We are ultimately interested in nding a way to describe the statistical properties of estimators when their exact distributions are unknown The concepts of consistency and convergence in probability are important But the theory of limiting distributions given earlier is not yet adequate We rarely deal with estimators that are not consistent for something though perhaps not always the parameter we are trying to estimate As such if plim n then n That is the limiting distribution of n is a spike This is not very informative nor is it at all what we have in mind when we speak of the statistical properties of an estimator To endow our nite sample estimator n with the zero sampling variance of the spike at would be optimistic in the extreme As an intermediate step then to a more reasonable description of the statistical properties of an estimator we use a stabilizing transformation of the random variable to one that does have
    d

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX D Large Sample Distribution Theory

    909

    a well de ned limiting distribution To jump to the most common application whereas plim n we often nd that zn d n n f z

    where f z is a well de ned distribution with a mean and a positive variance An estimator which has this property is said to be root n consistent The single most important theorem in econometrics provides an application of this proposition A basic form of the theorem is as follows

    THEOREM D 18 Lindberg Levy Central Limit Theorem Univariate
    If x1 xn are a random sample from a probability distribution with nite n mean and nite variance 2 and x n 1 n i 1 xi then n x n N 0 2
    d

    A proof appears in Rao 1973 p 127

    The result is quite remarkable as it holds regardless of the form of the parent distribution For a striking example return to Figure C 2 The distribution from which the data were drawn in that gure does not even remotely resemble a normal distribution In samples of only four observations the force of the central limit theorem is clearly visible in the sampling distribution of the means The sampling experiment Example D 5 shows the effect in a systematic demonstration of the result The Lindberg Levy theorem is one of several forms of this extremely powerful result For our purposes an important extension allows us to relax the assumption of equal variances The Lindberg Feller form of the central limit theorem is the centerpiece of most of our analysis in econometrics

    THEOREM D 19 Lindberg Feller Central Limit Theorem with Unequal Variances
    Suppose that xi i 1 n is a sequence of independent random variables with nite means i and nite positive variances i2 Let n 1 1 2 n n and n 2 12 2 2 n1

    If no single term dominates this average variance which we could state as limn max i n n 0 and if the average variance converges to a nite constant 2 limn n 2 then d n x n n N 0 2

    Greene 50240

    book

    June 28 2002

    14 40

    910

    APPENDIX D Large Sample Distribution Theory

    Density of Exponential Mean 1 0

    1 5

    8

    Density

    6

    4

    2

    0 0
    FIGURE D 2

    2

    4

    6

    8

    10

    The Exponential Distribution

    In practical terms the theorem states that sums of random variables regardless of their form will tend to be normally distributed The result is yet more remarkable in that it does not require the variables in the sum to come from the same underlying distribution It requires essentially only that the mean be a mixture of many random variables none of which is large compared with their sum Since nearly all the estimators we construct in econometrics fall under the purview of the central limit theorem it is obviously an important result
    Example D 6 The Lindberg Levy Central Limit Theorem

    We ll use a sampling experiment to demonstrate the operation of the central limit theorem Consider random sampling from the exponential distribution with mean 1 5 this is the setting used in Example C 4 The density is shown in Figure D 2 We ve drawn 1 000 samples of 3 6 and 20 observations from this population and com puted the sample means for each For each mean we then computed zi n n x i n where i 1 1 000 and n is 3 6 or 20 The three rows of gures show histograms of the observed samples of sample means and kernel density estimates of the underlying distributions for the three samples of transformed means Proof of the Lindberg Feller theorem requires some quite intricate mathematics see Loeve 1977 for example that are well beyond the scope of our work here We do note an important consideration in this theorem The result rests on a condition known as the Lindberg condition The sample mean computed in the theorem is a mixture of random variables from possibly different distributions The Lindeberg condition in words states that the contribution of the tail areas of these underlying distributions to the variance of the sum must be negligible in the limit The condition formalizes the assumption in Theorem D 12 that the average variance be positive and not be dominated by any single term For an intuitively crafted mathematical discussion of this condition see White 2001 pp 117 118 The condition is essentially impossible to verify in practice so it is useful to have a simpler version of the theorem which encompasses it

    Greene 50240

    book

    June 28 2002

    14 40

    Histogram for Variable Z3 140 43

    Kernel Density Estimate for Z3

    34 105 Frequency 26

    70

    Density

    17

    35 09

    0 4 000 2 857 1 714 571 571 1 714 2 857 4 000 Z3

    00 2 1 0 1 2 Z3 3 4 5 6

    Histogram for Variable Z6 128 42

    Kernel Density Estimate for Z6

    33 96 Frequency 25

    64

    Density

    17

    32 08

    0 4 000 2 857 1 714 571 571 1 714 2 857 4 000 Z6

    00 3 2 1 0 1 Z6 2 3 4 5

    Histogram for Variable Z20 124 39

    Kernel Density Estimate for Z20

    31 93 Frequency 23

    62

    Density

    15

    31 08

    0 4 000 2 857 1 714 571 571 1 714 2 857 4 000 Z20

    00 3 2 1 0 Z20 1 2 3 4

    FIGURE D 3

    THE CENTRAL LIMIT THEOREM

    911

    Greene 50240

    book

    June 28 2002

    14 40

    912

    APPENDIX D Large Sample Distribution Theory

    THEOREM D 20 Liapounov Central Limit Theorem
    Suppose that xi is a sequence of independent random variables with nite means i and nite positive variances i2 such that E xi i 2 is nite for some 0 If n is positive and nite for all n suf ciently large then d n x n n n N 0 1

    This version of the central limit theorem requires only that moments slightly larger than two be nite Note the distinction between the laws of large numbers in Theorems D 5 and D 6 and the central limit theorems Neither assert that sample means tend to normality Sample means that is the distributions of them converge to spikes at the true mean It is the transformation of the mean n x n that converges to standard normality To see this at work if you have access to the necessary software you might try reproducing Example D 5 using the raw means xin What do you expect to observe For later purposes we will require multivariate versions of these theorems Proofs of the following may be found for example in Greenberg and Webster 1983 or Rao 1973 and references cited there

    THEOREM D 18A Multivariate Lindberg Levy Central Limit Theorem
    If x1 xn are a random sample from a multivariate distribution with nite mean vector and nite positive de nite covariance matrix Q then d n xn N 0 Q where xn 1 n
    n

    xi
    i 1

    In order to get from D 18 to D 18A and D 19 to D 19A we need to add a step Theorem D 18 applies to the individual elements of the vector A vector has a multivariate normal distribution if the individual elements are normally distributed and if every linear combination is normally distributed We can use Theorem D 18 D 19 for the individual terms and Theorem D 17 to establish that linear combinations behave likewise This establishes the extensions

    The extension of the Lindberg Feller theorem to unequal covariance matrices requires some intricate mathematics The following is an informal statement of the relevant conditions Further discussion and references appear in Fomby Hill and Johnson 1984 and Greenberg and Webster 1983

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX D Large Sample Distribution Theory

    913

    THEOREM D 19A Multivariate Lindberg Feller Central Limit Theorem
    Suppose that x1 xn are a sample of random vectors such that E xi i Var xi Qi and all mixed third moments of the multivariate distribution are nite Let n 1 n
    n

    i
    i 1 n

    1 Qn n We assume that
    n

    Qi
    i 1

    lim Qn Q

    where Q is a nite positive de nite matrix and that for every i
    n n 1

    lim nQn 1 Qi lim

    n

    Qi
    i 1

    Qi 0

    We allow the means of the random vectors to differ although in the cases that we will analyze they will generally be identical The second assumption states that individual components of the sum must be nite and diminish in signi cance There is also an implicit assumption that the sum of matrices is nonsingular Since the limiting matrix is nonsingular the assumption must hold for large enough n which is all that concerns us here With these in place the result is d n xn n N 0 Q

    D 2 7

    THE DELTA METHOD

    At several points in Appendix C we used a linear Taylor series approximation to analyze the distribution and moments of a random variable We are now able to justify this usage We complete the development of Theorem D 12 probability limit of a function of a random variable Theorem D 16 2 limiting distribution of a function of a random variable and the central limit theorems with a useful result that is known as the delta method For a single random variable sample mean or otherwise we have the following theorem

    THEOREM D 21 Limiting Normal Distribution of a Function
    If

    d n zn N 0 2 and if g zn is a continuous function not involving n then d n g zn g N 0 g 2 2 D 18

    Greene 50240

    book

    June 28 2002

    14 40

    914

    APPENDIX D Large Sample Distribution Theory

    Notice that the mean and variance of the limiting distribution are the mean and variance of the linear Taylor series approximation g zn g g zn

    The multivariate version of this theorem will be used at many points in the text

    THEOREM D 21A Limiting Normal Distribution of a Set of Functions

    If zn is a K 1 sequence of vector valued random variables such that n zn N 0 and if c zn is a set of J continuous functions of zn not involving n then d n c zn c N 0 C C D 19 where C is the J K matrix c The jth row of C is the vector of partial derivatives of the jth function with respect to

    d

    D 3

    ASYMPTOTIC DISTRIBUTIONS

    The theory of limiting distributions is only a means to an end We are interested in the behavior of the estimators themselves The limiting distributions obtained through the central limit theorem all involve unknown parameters generally the ones we are trying to estimate Moreover our samples are always nite Thus we depart from the limiting distributions to derive the asymptotic distributions of the estimators

    DEFINITION D 12 Asymptotic Distribution
    An asymptotic distribution is a distribution that is used to approximate the true nite sample distribution of a random variable 5

    By far the most common means of formulating an asymptotic distribution at least by econometricians is to construct it from the known limiting distribution of a function of the random variable If d n x n N 0 1

    5 We depart from some other treatments e g White 2001 Hayashi 2000 p 90 at this point because they

    make no distinction between an asymptotic distribution and the limiting distribution although the treatments are largely along the lines discussed here In the interest of maintaining consistency of the discussion we prefer to retain the sharp distinction and derive the asymptotic distribution of an estimator t by rst obtaining the limiting distribution of n t By our construction the limiting distribution of t is degenerate whereas the asymptotic distribution of n t is not useful

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX D Large Sample Distribution Theory

    915

    1 75 1 50 1 25 1 00 f x 0 75 0 50 0 25 Asymptotic distribution

    Exact distribution

    0
    FIGURE D 4

    0 5

    1 0

    1 5 x

    2 0

    2 5

    3 0

    True Versus Asymptotic Distribution

    then approximately or asymptotically x n N 2 n which we write as x N 2 n The statement x n is asymptotically normally distributed with mean and variance 2 n says only that this normal distribution provides an approximation to the true distribution not that the true distribution is exactly normal
    Example D 7 Asymptotic Distribution of the Mean of an Exponential Sample
    a

    In sampling from an exponential distribution with parameter the exact distribution of x n is that of 2n times a chi squared variable with 2n degrees of freedom The asymptotic distribution is N 2 n The exact and asymptotic distributions are shown in Figure D 4 for the case of 1 and n 16 Extending the de nition suppose that n is an estimator of the parameter vector The n is obtained from the limiting distribution asymptotic distribution of the vector implies that 1 a n N V n D 21
    d n n N 0 V

    D 20

    This notation is read n is asymptotically normally distributed with mean vector and covariance matrix 1 n V The covariance matrix of the asymptotic distribution is the asymptotic covariance matrix and is denoted Asy Var n 1 V n

    Greene 50240

    book

    June 28 2002

    14 40

    916

    APPENDIX D Large Sample Distribution Theory

    Note once again the logic used to reach the result 4 35 holds exactly as n We assume that it holds approximately for nite n which leads to 4 36

    DEFINITION D 13 Asymptotic Normality and Asymptotic Ef ciency
    An estimator n is asymptotically normal if D 20 holds The estimator is asymptotically ef cient if the covariance matrix of any other consistent asymptotically normally distributed estimator exceeds 1 n V by a nonnegative de nite matrix

    For most estimation problems these are the criteria used to choose an estimator
    Example D 8 Asymptotic Inef ciency of the Median in Normal Sampling

    In sampling from a normal distribution with mean and variance 2 both the mean x n and the median Mn of the sample are consistent estimators of Since the limiting distributions of both estimators are spikes at they can only be compared on the basis of their asymptotic properties The necessary results are x n N 2 n
    a

    and

    Mn N 2 2 n

    a

    D 22

    Therefore the mean is more ef cient by a factor of 2 But see Examples E 1 and E 2 for a nite sample result
    D 3 1 ASYMPTOTIC DISTRIBUTION OF A NONLINEAR FUNCTION

    Theorems D 12 and D 14 for functions of a random variable have counterparts in asymptotic distributions

    d If n n N 0 2 and if g is a continuous function not involving n then a g n N g 1 n g 2 2 If n is a vector of parameter estimators such that a n N 1 n V and if c is a set of J continuous functions not involving n then a c n N c 1 n C VC where C c

    THEOREM D 22 Asymptotic Distribution of a Nonlinear Function

    Example D 9

    Suppose that bn and tn are estimators of parameters and such that bn a N tn

    Asymptotic Distribution of a Function of Two Estimators

    Find the asymptotic distribution of cn bn 1 tn Let 1 By the Slutsky theorem cn is consistent for We shall require 1 1 Let 1 2

    be the 2 2 asymptotic covariance matrix given previously Then the asymptotic

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX D Large Sample Distribution Theory

    917

    variance of cn is Asy Var cn
    2 2 2

    which is the variance of the linear Taylor series approximation n bn tn

    D 3 2

    ASYMPTOTIC EXPECTATIONS

    The asymptotic mean and variance of a random variable are usually the mean and variance of the asymptotic distribution Thus for an estimator with the limiting distribution de ned in d n n N 0 V the asymptotic expectation is and the asymptotic variance is 1 n V This statement implies among other things that the estimator is asymptotically unbiased At the risk of clouding the issue a bit it is necessary to reconsider one aspect of the previous description We have deliberately avoided the use of consistency even though in most instances that is what we have in mind The description thus far might suggest that consistency and asymptotic unbiasedness are the same Unfortunately because it is a source of some confusion they are not They are if the estimator is consistent and asymptotically normally distributed or CAN They may differ in other settings however There are at least three possible de nitions of asymptotic unbiasedness 1 The mean of the limiting distribution of n n is 0 2 limn E n D 23 3 plim n In most cases encountered in practice the estimator in hand will have all three properties so there is no ambiguity It is not dif cult to construct cases in which the left hand sides of all three de nitions are different however 6 There is no general agreement among authors as to the precise meaning of asymptotic unbiasedness perhaps because the term is misleading at the outset asymptotic refers to an approximation whereas unbiasedness is an exact result 7 Nonetheless the majority view seems to be that 2 is the proper de nition of asymptotic unbiasedness 8 Note though that this de nition relies on quantities that are generally unknown and that may not exist A similar problem arises in the de nition of the asymptotic variance of an estimator One common de nition is Asy Var n 1 lim E n n n n lim E n
    n 2

    9

    D 24

    This result is a leading term approximation and it will be suf cient for nearly all applications
    6 See 7 See

    for example Maddala 1977a p 150 for example Theil 1971 p 377

    8 Many

    studies of estimators analyze the asymptotic bias of say n as an estimator of a parameter In most cases the quantity of interest is actually plim n See for example Greene 1980b and another example in Johnston 1984 p 312 1986 p 165

    9 Kmenta

    Greene 50240

    book

    June 28 2002

    14 40

    918

    APPENDIX D Large Sample Distribution Theory

    Note however that like de nition 2 of asymptotic unbiasedness it relies on unknown and possibly nonexistent quantities
    Example D 10 Asymptotic Moments of the Sample Variance
    n

    The exact expected value and variance of the variance estimator m2 are E m2 and Var m2 4 4 4 3 4 2 4 2 4 2 n n n3 D 27 n 1 2 n D 26 1 n xi x 2
    i 1

    D 25

    where 4 E x 4 See Goldberger 1964 pp 97 99 The leading term approximation would be Asy Var m2 1 4 4 n

    D 4

    SEQUENCES AND THE ORDER OF A SEQUENCE

    This section has been concerned with sequences of constants denoted for example cn and random variables such as xn that are indexed by a sample size n An important characteristic of a sequence is the rate at which it converges or diverges For example as we have seen the mean of a random sample of n observations from a distribution with nite mean and nite variance 2 is itself 2 2 a random variable with variance n 2 n We see that as long as 2 is a nite constant n is a sequence of constants that converges to zero Another example is the random variable x 1 n the minimum value in a random sample of n observations from the exponential distribution with mean 1 de ned in Example C 4 It turns out that x 1 n has variance 1 n 2 Clearly this variance also converges to zero but intuition suggests faster than 2 n does On the other hand the sum of the integers from one to n Sn n n 1 2 obviously diverges as n albeit faster one might expect than the log of the likelihood function for the exponential distribution in Example 4 6 which is log L n log x n As a nal example consider the downward bias of the maximum likelihood estimator of the variance of the normal distribution cn n 1 n which is a constant that converges to one See Examples C 5 We will de ne the rate at which a sequence converges or diverges in terms of the order of the sequence

    A sequence cn is of order n denoted O n if and only if plim 1 n cn is a nite nonzero constant

    DEFINITION D 14 Order n

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    919

    A sequence cn is of order less than n denoted o n if and only if plim 1 n cn equals zero
    2 Thus in our examples n is O n 1 Var x 1 n is O n 2 and o n 1 Sn is O n2 2 in this case log L is O n 1 and cn is O 1 0 Important particular cases that we will encounter repeatedly in our work are sequences for which 1 or 1 The notion of order of a sequence is often of interest in econometrics in the context of the variance of an estimator Thus we see in Section C 3 that an important element of our strategy for forming an asymptotic distribution is that the variance of the limiting distribution of n x n is O 1 In Example D 9 the variance of m2 is the sum of three terms that are O n 1 O n 2 and O n 3 The sum is O n 1 because n Var m2 converges to 4 4 the numerator of the rst or leading term whereas the second and third terms converge to zero This term is also the dominant term of the sequence Finally consider the two divergent examples in the preceding list Sn is simply a deterministic function of n that explodes However log L n log i xi is the sum of a constant that is O n and a random variable with variance equal to n The random variable diverges in the sense that its variance grows without bound as n increases

    DEFINITION D 15 Order less than n

    APPENDIX E

    Q
    COMPUTATION AND OPTIMIZATION
    E 1 INTRODUCTION
    The computation of empirical estimates by econometricians involves using digital computers and software written either by the researchers themselves or by others 1 It is also a surprisingly balanced mix of art and science It is important for software users to be aware of how results are obtained not only to understand routine computations but also to be able to explain the occasional strange and contradictory results that do arise This appendix will describe some of the basic elements of computing and a number of tools that are used by econometricians 2 Sections E 2
    1 It

    is one of the interesting aspects of the development of econometric methodology that the adoption of certain classes of techniques has proceeded in discrete jumps with the development of software Noteworthy examples include the appearance both around 1970 of G K Joreskog s LISREL Joreskog and Sorbom 1981 program which spawned a still growing industry in linear structural modeling and TSP Hall 1982 which was among the rst computer programs to accept symbolic representations of econometric models and which provided a signi cant advance in econometric practice with its LSQ procedure for systems of equations

    2 This

    discussion is not intended to teach the reader how to write computer programs For those who expect to do so there are whole libraries of useful sources Three very useful works are Kennedy and Gentle 1980 Abramovitz and Stegun 1971 and especially Press et al 1986 The third of these provides a wealth of expertly written programs and a large amount of information about how to do computation ef ciently and accurately A recent survey of many areas of computation is Judd 1998

    Greene 50240

    book

    June 28 2002

    14 40

    920

    APPENDIX E Computation and Optimization

    and E 3 present issues that arise in generation of arti cial data using Monte Carlo methods Section E 4 describes bootstrapping which is a method often used for estimating variances when analytical expressions cannot be obtained Section E 5 then describes some techniques for computing certain integrals and derivatives that are recurrent in econometric applications Section E 6 presents methods of optimization of functions Some examples are also given in Section E 6

    E 2

    DATA INPUT AND GENERATION

    The data used in an econometric study can be broadly characterized as either real or simulated Real data consist of actual measurements on some physical phenomenon such as the level of activity of an economy or the behavior of real consumers For present purposes the de ning characteristic of such data is that they are generated outside the context of the empirical study and are gathered for the purpose of measuring some aspect of their real world counterpart such as an elasticity of some aspect of consumer behavior The alternative is simulated data produced by the analyst with a random number generator usually for the purpose of studying the behavior of econometric estimators for which the statistical properties are unknown or impossible to derive This section will consider a few aspects of the manipulation of data with a computer
    E 2 1 GENERATING PSEUDO RANDOM NUMBERS

    Monte Carlo methods and Monte Carlo studies of estimators are enjoying a owering in the econometrics literature In these studies data are generated internally in the computer using pseudo random number generators These computer programs generate sequences of values that appear to be strings of draws from a speci ed probability distribution There are many types of random number generators but most take advantage of the inherent inaccuracy of the digital representation of real numbers The method of generation is usually by the following steps 0 1 2 3 4 Set a seed Update the seed by seed j seed j 1 s value x j seed j x value Transform x j if necessary then move x j to desired place in memory Return to Step 1 or exit if no additional values are needed

    Random number generators produce sequences of values that resemble strings of random draws from the speci ed distribution In fact the sequence of values produced by the preceding method is not truly random at all it is a deterministic Markov chain of values The set of 32 bits in the random value only appear random when subjected to certain tests See Press et al 1986 Since the series is in fact deterministic at any point that a generator produces a value it has produced before it must thereafter replicate the entire sequence Since modern digital computers typically use 32 bit double precision variables to represent numbers it follows that the longest string of values that this kind of generator can produce is 232 1 about 2 1 billion This length is the period of a random number generator A generator with a shorter period than this would be inef cient since it is possible to achieve this period with some fairly simple algorithms Some improvements in the periodicity of a generator can be achieved by the method of shuf ing By this method a set of say 128 values is maintained in an array The random draw is used to select one of these 128 positions from which the draw is taken and then the value in the array is replaced with a draw from the generator The period of the generator can also be increased by combining several generators See L Ecuyer 1998 and Greene 2001 The deterministic nature of pseudo random number generators is both a aw and a virtue Since many Monte Carlo studies require billions of draws the nite period of any generator represents a nontrivial consideration On the other hand being able to reproduce a sequence of

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    921

    values just by resetting the seed to its initial value allows the researcher to replicate a study 3 The seed itself can be a problem It is known that certain seeds in particular generators will produce shorter series or series that do not pass randomness tests For example congruential generators of the sort discussed above should be started from odd seeds
    E 2 2 SAMPLING FROM A STANDARD UNIFORM POPULATION

    When sampling from a standard uniform U 0 1 population the sequence is a kind of difference equation since given the initial seed x j is ultimately a function of x j 1 In most cases the result at step 2 is a pseudodraw from the continuous uniform distribution in the range zero to one which can then be transformed to a draw from another distribution by using the fundamental probability transformation
    E 2 3 SAMPLING FROM CONTINUOUS DISTRIBUTIONS

    As soon as the sequence of U 0 1 values is obtained there are several ways to transform them to a sample from the desired distribution A common approach is to use the fundamental probability transform For continuous distributions this is done by treating the draw F as if F were F x where F is the cdf of x For example if we desire draws from the exponential distribution with known then F x 1 exp x The inverse transform is x 1 ln 1 F For example for a draw of 0 4 with 5 the associated x would be 0 1022 One of the most common applications is the draws from the standard normal distribution which is complicated because there is no closed form for 1 F There are several ways to proceed One is to approximate the inverse function One well known approximation is given in Abramovitz and Stegun 1971
    1

    F x T

    c0 c1 T c2 T 2 1 d1 T d2 T 2 d3 T 3

    where T ln 1 H2 1 2 and H F if F 0 5 and 1 F otherwise The sign is then reversed if F 0 5 A second method is to transform the U 0 1 values directly to a standard normal value The Box Muller 1958 method is z 2 ln x1 1 2 cos 2 x2 where x1 and x2 are two independent U 0 1 draws A second N 0 1 draw can be obtained from the same two values by replacing cos with sin in the transformation The Marsagila Bray 1964 generator zi xi 2 v ln v 1 2 2 2 where xi 2wi 1 wi is a random draw from U 0 1 and v x1 x2 i 1 2 is often used as well The pair of draws must be rejected and redrawn if v 1 Sequences of draws from the standard normal distribution can be transformed easily into draws from other distributions by making use of the results in Section B 4 The square of a standard normal has chi squared 1 and the sum of K chi squareds is chi squared K From this relationship it is possible to produce samples from the chi squared t F and beta distributions A related problem is obtaining draws from the truncated normal distribution An obviously inef cient albeit effective method of drawing values from the truncated normal 2 distribution in the range L U is simply to draw F from the U 0 1 distribution and transform it rst to a standard normal variate as discussed previously and then to the N 2 variate by using x 1 F Finally the value x is retained if it falls in the range L U and discarded otherwise This method will require on average 1 U L draws per observation which could be substantial A direct transformation that requires only one draw is as follows Let Pj j j L U Then x
    3 Current 1

    PL F PU PL

    E 1

    trends in the econometrics literature dictate that readers of empirical studies be able to replicate applied work In Monte Carlo studies at least in principle data can be replicated ef ciently merely by providing the random number generator and the seed

    Greene 50240

    book

    June 28 2002

    14 40

    922

    APPENDIX E Computation and Optimization E 2 4 SAMPLING FROM A MULTIVARIATE NORMAL POPULATION

    A common application involves draws from a multivariate normal distribution with speci ed mean and covariance matrix To sample from this K variate distribution we begin with a draw z from the K variate standard normal distribution just by stacking K independent draws from the univariate standard normal distribution Let T be the square root of such that TT 4 The desired draw is then just x Tz A draw from a Wishart distribution of order K which is a multivariate generalization of the chi squared distribution can be produced by computing X M0 X where each row of X is a draw from the multivariate normal distribution Note that the Wishart is a matrix variate random variable and that a sample of M draws from the Wishart distribution ultimately requires M N K draws from the standard normal distribution however generated
    E 2 5 SAMPLING FROM A DISCRETE POPULATION

    Discrete distributions such as the Poisson present a different problem There is no obvious inverse transformation for most of these One inef cient albeit unfortunately unavoidable method for some distributions is to draw the F and then search sequentially for the discrete value that has cdf equal to or greater than F This procedure makes intuitive sense but it can involve a lot of computation The rejection method described by Press et al 1986 pp 203 209 will be more ef cient although not more accurate for some distributions
    E 2 6 THE GIBBS SAMPLER

    The following problem is pervasive in Bayesian statistics and econometrics although it has many applications in classical problems as well See Chapter 16 for an application We are given a joint density f x y1 y2 yK We are interested in studying the characteristics such as the mean of the marginal distribution f x
    yK


    y1

    f x y1 y2 yK dy1 dyK

    The direct approach actually doing the integration to obtain the marginal density may be infeasible or at least complicated enough to seem so But the Gibbs sampler a technique that has begun to enjoy a surge of activity in the econometrics literature allows one to generate random draws from the marginal density f x without having to compute it 5 6 The theory is presented in Casella and George 1992 among others We will brie y sketch the mechanics of the technique and examine an application to a bivariate distribution Consider a two variable case f x y in which f x y and f y x are known A Gibbs sequence of draws y0 x0 y1 x1 y2 yM xM is generated as follows First y0 is speci ed manually Then x0 is obtained as a random draw from the population f x y0 Then y1 is drawn

    4 In 5A

    practice this is usually done with a Cholesky decomposition in which T is a lower triangular matrix See Section B 7 11

    very readable introduction to the technique on which we have based most of this discussion is Casella and George 1992 technique lends itself naturally to Bayesian applications which is where most of the applications are to be found See for example Albert and Chib 1993a b Chib 1992 Chib and Greenberg 1996 and Carlin and Chib 1995 There are classical applications as well as surveyed in Tanner 1993 and Gelfand and Smith 1990

    6 The

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    923

    from f y x0 and so on The iteration is generically as follows 1 2 3 Draw x j from f x y j Draw y j 1 from f y x j Exit or return to step 1

    Note that the sequence of values are not independent they are a Markov chain If enough iterations are completed the nal observation xM is a draw from f x and likewise for yM 7 Characteristics of the marginal distributions such as the means variances or values of the densities can then be studied just by using corresponding averages of the functions of the observations

    E 3

    MONTE CARLO STUDIES

    Simulated data generated by the methods of the previous section have various uses in econometrics One of the more common applications is the derivation of the properties of estimators or in obtaining comparisons of the properties of estimators For example in time series settings most of the known results for characterizing the sampling distributions of estimators are asymptotic large sample results But the typical time series is not very long and descriptions that rely on T the number of observations going to in nity may not be very accurate Exact nite sample properties are usually intractable however which leaves the analyst with only the choice of learning about the behavior of the estimators experimentally In the typical application one would either compare the properties of two or more estimators while holding the sampling conditions xed or study how the properties of an estimator are affected by changing conditions such as the sample size or the value of an underlying parameter
    Example E 1 Monte Carlo Study of the Mean Versus the Median

    In Example D 7 we compared the asymptotic distributions of the sample mean and the sample median in random sampling from the normal distribution The basic result is that both estimators are consistent but the mean is asymptotically more ef cient by a factor of Asy Var Median 1 5708 Asy Var Mean 2 This result is useful but it does not tell which is the better estimator in small samples nor does it suggest how the estimators would behave in some other distribution It is known that the mean is affected by outlying observations whereas the median is not The effect is averaged out in large samples but the small sample behavior might be very different To investigate the issue we constructed the following experiment We sampled 500 observations from the t distribution with d degrees of freedom by sampling d 1 values from the standard normal distribution and then computing zi r d 1 ti r i 1 500 r 1 100 1d2 z d l 1 i r l The t distribution with a low value of d was chosen because it has very thick tails and because large outlying values have high probability For each value of d we generated R 100 replications For each of the 100 replications we obtained the mean and median Since both are unbiased we compared the mean squared errors around the true expectations using Md 1 R
    R medianr 0 2 r 1 R 1 R r 1 xr 0 2

    7 Determining when to stop the sequence is an interesting and yet unsolved problem See Casella and George

    1992 pp 172 173 Raftery and Lewis 1992 Roberts 1992 and Zellner and Min 1995

    Greene 50240

    book

    June 28 2002

    14 40

    924

    APPENDIX E Computation and Optimization

    We obtained ratios of 0 6761 1 2779 and 1 3765 for d 3 6 and 10 respectively You might want to repeat this experiment with different degrees of freedom These results agree with what intuition would suggest As the degrees of freedom parameter increases which brings the distribution closer to the normal distribution the sample mean becomes more ef cient the ratio should approach its limiting value of 1 5708 as d increases What might be surprising is the apparent overwhelming advantage of the median when the distribution is very nonnormal The preceding is a very small straightforward application of the technique In a typical study there are many more parameters to be varied and more dimensions upon which the results are to be studied One of the practical problems in this setting is how to organize the results There is a tendency in Monte Carlo work to proliferate tables indiscriminately It is incumbent on the analyst to collect the results in a fashion that is useful to the reader For example this requires some judgment on how nely one should vary the parameters of interest One useful possibility that will often mimic the thought process of the reader is to collect the results of bivariate tables in carefully designed contour plots There are any number of situations in which Monte Carlo simulation offers the only method of learning about nite sample properties of estimators Still there are a number of problems with Monte Carlo studies To achieve any level of generality the number of parameters that must be varied and hence the amount of information that must be distilled can become enormous Second they are limited by the design of the experiments so the results they produce are rarely generalizable For our example we may have learned something about the t distribution But the results that would apply in other distributions remain to be described And unfortunately real data will rarely conform to any speci c distribution so no matter how many other distributions we analyze our results would still only be suggestive In more general terms this problem of speci city Hendry 1984 limits most Monte Carlo studies to quite narrow ranges of applicability There are very few that have proved general enough to have provided a widely cited result 8

    E 4

    BOOTSTRAPPING AND THE JACKKNIFE

    The technique of bootstrapping is used to obtain a description of the sampling properties of empirical estimators using the sample data themselves rather than broad theoretical results 9 Suppose that n is an estimate of a parameter vector based on a sample X x1 xn An approximation to the statistical properties of n can be obtained by studying a sample of bootstrap estimators b m b 1 B obtained by sampling n observations with replacement from X and recomputing with each sample After a total of B times the desired sampling characteristic is computed from 1 m B m For example if it were known that the estimator were consistent and if n were reasonably large then one might approximate the asymptotic covariance matrix of the estimator by using Est Asy Var 1 B
    B

    b m n b m n
    b 1

    8 Two 9 See

    that have withstood the test of time are Griliches and Rao 1969 and Kmenta and Gilbert 1968

    Efron 1979 and Efron and Tibshirani 1993

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    925

    This technique was developed by Efron 1979 and has been appearing with increasing frequency in the applied econometrics literature See for example Veall 1987 1992 Vinod 1993 and Vinod and Raj 1994 An application of this technique to the least absolute deviations in the linear model is shown in the example below and in Chapter 5 and to a model of binary choice in Section 21 5 3
    Example E 2 Bootstrapping the Variance of the Median

    As discussed earlier there are few cases in which an exact expression for the sampling variance of the median are known In the previous example we examined the case of the median of a sample of 500 observations from the t distribution with 10 degrees of freedom This is one of those cases in which there is no exact formula for the asymptotic variance of the median However we can use the bootstrap technique to estimate one empirically You might want to replicate this experiment It is revisited in the exercises To demonstrate consider the same data as used in the preceding example We have a sample of 500 observations for which we have computed the median 00786 We drew 100 samples of 500 with replacement from this sample and recomputed the median with each of these samples The empirical square root of the mean squared deviation around this estimate of 00786 was 0 056 In contrast consider the same calculation for the mean The sample mean is 0 07247 The sample standard deviation is 1 08469 so the standard error of the mean is 0 04657 The bootstrap estimate of the standard error of the mean was 0 052 This agrees with our expectation in that the sample mean should generally be a more ef cient estimator of the mean of the distribution in a large sample There is another approach we might take in this situation Consider the regression model yi i where i has a symmetric distribution with nite variance As discussed in Chapter 8 the least absolute deviations estimator of the coef cient in this model is an estimator of the median which equals the mean of the distribution So this presents another estimator Once again the bootstrap estimator must be used to estimate the asymptotic variance of the estimator Using the same data we t this regression model using the LAD estimator The coef cient estimate is 05397 with a bootstrap estimated standard error of 0 05872 The estimated standard error agrees with the earlier one The difference in the estimated coef cient stems from the different computations the regression estimate is the solution to a linear programming problem while the earlier estimate is the actual sample median The jackknife estimator is similar to the bootstrap estimator Efron and Tibshirani argue that it is an approximation to the bootstrap The jackknife procedure is carried out be redoing the estimation for i 1 n times in each case leaving out the ith observation The remaining computations are the same as for the bootstrap The comparison of the two procedures is inconclusive but Efron and Tibshirani suggest that by several criteria the bootstrap is likely to be preferable For a large sample the simple advantage of the bootstrap estimator in terms of the amount of computation is likely to be compelling

    E 5

    COMPUTATION IN ECONOMETRICS

    The preceding showed how a number is translated from a symbol on a page to a physical entity in a computer that can be manipulated as part of a statistical study This section will discuss some aspects of data manipulation

    Greene 50240

    book

    June 28 2002

    14 40

    926

    APPENDIX E Computation and Optimization E 5 1 COMPUTING INTEGRALS

    One advantage of computers is their ability rapidly to compute approximations to complex functions such as logs and exponents The basic functions such as these trigonometric functions and so forth are standard parts of the libraries of programs that accompany all scienti c computing installations 10 But one of the very common applications that often requires some high level creativity by econometricians is the evaluation of integrals that do not have simple closed forms and that do not typically exist in system libraries We will consider several of these in this section We will not go into detail on the nuts and bolts of how to compute integrals with a computer rather we will turn directly to the most common applications in econometrics
    E 5 2 THE STANDARD NORMAL CUMULATIVE DISTRIBUTION FUNCTION

    The standard normal cumulative distribution function cdf is ubiquitous in econometric models Yet this most homely of applications must be computed by approximation There are a number of ways to do so 11 Recall that what we desire is
    x

    x


    t dt

    1 2 where x e x 2 2

    One way to proceed is to use a Taylor series
    M

    x
    i 0

    1 di x0 x x0 i i i dx0

    The normal cdf has some advantages for this approach First the derivatives are simple and not integrals Second the function is analytic as M the approximation converges to the true value Third the derivatives have a simple form which we have met before they are the Hermite polynomials and they can be computed by a simple recursion The 0th term in the expansion above is x evaluated at the expansion point The rst derivative of the cdf is the pdf so the terms from 2 onward are the derivatives of x once again evaluated at x0 The derivatives of the standard normal pdf obey the recursion i x x i 1 x i 1 i 2 x where i is di x dxi The zero and one terms in the sequence are one and x The next term is x 2 1 followed by 3x x 3 and x 4 6x 2 3 and so on The approximation can be made more accurate by adding terms Consider using a fth order Taylor series approximation around the point x 0 where 0 0 5 and 0 0 3989423 Evaluating the derivatives at 0 and assembling the terms produces the approximation x
    1 2

    0 3989423 x x 3 6 x 5 40

    Some of the terms every other one in fact will conveniently drop out Figure E 1 shows the actual values F and approximate values FA over the range 2 to 2 The gure shows two important points First the approximation is remarkably good over most of the range Second as is usually true for Taylor series approximations the quality of the approximation deteriorates as one gets far from the expansion point
    10 Of

    11 Many

    course at some level these must have been programmed as approximations by someone x 2 system libraries provide a related function the error function erf x 2 0 e t dt If this is 1 1 available then the normal cdf can be obtained from x 2 2 erf x 2 x 0 and x 1 x x 0

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    927

    1 20 1 00 0 80 0 60 0 40 0 20 0 00 0 20 2 00
    F FA

    1 50

    1 00

    0 50

    0 00 x

    0 50

    1 00

    1 50

    2 00

    FIGURE E 1

    Approximation to Normal cdf

    Unfortunately it is the tail areas of the standard normal distribution that are usually of interest so the preceding is likely to be problematic An alternative approach that is used much more often is a polynomial approximation reported by Abramovitz and Stegun 1971 p 932
    5

    x x
    i 1

    ai t i x where t 1 1 a0 x

    The complement is taken if x is positive The error of approximation is less than 7 5 10 8 for all x Note that the error exceeds the function value at x 5 7 so this is the operational limit of this approximation
    E 5 3 THE GAMMA AND RELATED FUNCTIONS

    The standard normal cdf is probably the most common application of numerical integration of a function in econometrics Another very common application is the class of gamma functions For positive constant P the gamma function is


    P
    0

    t P 1 e t dt

    The gamma function obeys the recursion P P 1 P 1 so for integer values of P P P 1 This result suggests that the gamma function can be viewed as a generalization of the factorial function for noninteger values Another convenient value is 1 By 2 making a change of variable it can be shown that for positive constants a c and P
    0

    t P 1 e at dt
    0

    c



    t P 1 e a t dt

    c

    1 P c a c

    P c

    As a generalization of the factorial function the gamma function will usually over ow for the sorts of values of P that normally appear in applications The log of the function should

    Greene 50240

    book

    June 28 2002

    14 40

    928

    APPENDIX E Computation and Optimization

    normally be used instead The function ln P can be approximated remarkably accurately with only a handful of terms and is very easy to program A number of approximations appear in the literature they are generally modi cations of Sterling s approximation to the factorial function P 2 P 1 2 P P e P so ln P P 0 5 ln P P 0 5 ln 2 C P where C is the correction term see e g Abramovitz and Stegun 1971 p 257 Press et al 1986 p 157 or Rao 1973 p 59 and P is the approximation error 12 The derivatives of the gamma function are dr P d Pr
    0 2

    ln P r t P 1 e t dt and

    2 The rst two derivatives of ln P are denoted P and P 13 are known as the digamma and trigamma functions The beta function denoted a b
    1

    a b
    0

    t a 1 1 t b 1 dt

    a b a b

    is related
    E 5 4 APPROXIMATING INTEGRALS BY QUADRATURE

    The digamma and trigamma functions and the gamma function for noninteger values of P and values that are not integers plus 1 do not exist in closed form and must be approximated Most 2 other applications will also involve integrals for which no simple computing function exists The simplest approach to approximating
    U x

    F x
    L x

    f t dt

    is likely to be a variant of Simpson s rule or the trapezoid rule For example one approximation see Press et al 1986 p 108 is F x
    1 3

    f1

    4 3

    f2

    2 3

    f3

    4 3

    f4

    2 3

    fN 2

    4 3

    fN 1

    1 3

    fN

    where f j is the function evaluated at N equally spaced points in L U including the endpoints and L U N 1 There are a number of problems with this method most notably that it is dif cult to obtain satisfactory accuracy with a moderate number of points Gaussian quadrature is a popular method of computing integrals The general approach is to use an approximation of the form
    U M

    W x f x dx
    L j 1

    w j f a j

    where W x is viewed as a weighting function for integrating f x w j is the quadrature weight and a j is the quadrature abscissa Different weights and abscissas have been derived for several
    12 For

    example one widely used formula is C z 1 12 z 3 360 z 5 1260 z 7 1680 q where z P and q 0 if P 18 or z P J and q ln P P 1 P 2 P J 1 where J 18 INT P if not Note in the approximation we write P P P a correction

    13 Tables of speci c values for the gamma digamma and trigamma functions appear in Abramovitz and Stegun

    1971 Most contemporary econometric programs have built in functions for these common integrals so the tables are not generally needed

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    929

    weighting functions Two weighting functions common in econometrics are W x x c e x x 0

    for which the computation is called Gauss Laguerre quadrature and W x e x
    2

    x

    for which the computation is called Gauss Hermite quadrature The theory for deriving weights and abscissas is given in Press et al 1986 pp 121 125 Tables of weights and abscissas for many values of M are given by Abramovitz and Stegun 1971
    E 5 5 MONTE CARLO INTEGRATION

    The quadrature methods have proved very useful in empirical research and are surprisingly accurate even for a small number of points There are integrals that defy treatment in this form however Recent work has brought many advances in techniques for evaluating complex integrals by using Monte Carlo methods rather than direct numerical approximations
    Example E 3 Fractional Moments of the Truncated Normal Distribution


    The following function appeared in Greene s 1990 study of the stochastic frontier model zr h r
    0 0

    1 z 2

    dz dz

    1 z 2

    If we let 2 we can show that the denominator is just 1 which is a value from the standard normal cdf But the numerator is complex It does not t into a form that lends itself to Gauss Laguerre integration because the exponential function involves both z and z2 An alternative form that has potential is obtained by making the change of variable to w z which produces the right weighting function But now the range of integration is not to it is to There is another approach Suppose that z is a random variable with N 2 distribution Then the density of the truncated normal at zero distribution for z is f z f z z 0 Prob z 0 z 1 1 z dz 0

    This result is exactly the weighting function that appears in h r and the function being weighted is z r Therefore h r is the expected value of z r given that z is greater than zero That is h r is a possibly fractional moment we do not restrict r to integer values from the truncated at zero normal distribution when the untruncated variable has mean 2 and variance 2 Now that we have identi ed the function how do we compute it We have already concluded that the familiar quadrature methods will not suf ce And no one has previously derived closed forms for the fractional moments of the normal distribution truncated or not But if we can draw a random sample of observations from this truncated normal distribution zi then the sample mean of wi zr will converge in probability mean square to its popui lation counterpart The remaining detail is to establish that this expectation is nite which it is for the truncated normal distribution see Amemiya 1973 Since we showed earlier how to draw observations from a truncated normal distribution this remaining step is simple

    Greene 50240

    book

    June 28 2002

    14 40

    930

    APPENDIX E Computation and Optimization

    The preceding is a fairly straightforward application of Monte Carlo integration In certain cases an integral can be approximated by computing the sample average of a set of function values The approach taken here was to interpret the integral as an expected value We then had to establish that the mean we were computing was nite Our basic statistical result for the behavior of sample means implies that with a large enough sample we can approximate the integral as closely as we like The general approach is widely applicable in Bayesian econometrics and has begun to appear in classical statistics and econometrics as well 14 For direct application in a straightforward class of problems we consider the general computation
    U

    F x
    L

    f x g x dx

    where g x is a continuous function in the range L U We could achieve greater generality by allowing more complicated functions but for current purposes we limit ourselves to straightforward cases Now suppose that g x is nonnegative in the entire range L U To normalize the weighting function we suppose as well that
    U

    K
    L

    g x dx

    is a known constant Then h x g x K is a probability density function in the range because it satis es the axioms of probability 15 Let
    x

    H x
    L

    h t dt

    Then H L 0 H U 1 H x h x 0 and so on Then
    U U

    f x g x dx K
    L L

    f x

    g x dx KEh x f x K

    where we use the notation Eh x f x to denote the expected value of the function f x when x is drawn from the population with probability density function h x We assume that this expected value is a nite constant This set of results de nes the computation We now assume that we are able to draw pseudo random samples from the population h x Since K is a known constant and the means of random samples are unbiased and consistent estimators of their population counterparts the sample mean of the functions F x K xih 1 n
    n

    f xih
    i 1

    where is a random draw from h is a consistent estimator of the integral The claim is based on the Corollary to Theorem D 4 as the integral is equal to the expected value of the function f x Suppose that the problem is well de ned as above but that it is not possible to draw random samples from the population h If there is another probability density function that resembles h say I x then there may be an alternative strategy We can rewrite our computation in the
    14 See 15 In

    Geweke 1986 1988 1989 for discussion and applications A number of other references are given in Poirier 1995 p 654 many applications K will already be part of the desired integral

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    931

    form
    U U

    F x K
    L

    f x h x dx K
    L

    f x h x I x dx I x

    Then we can interpret our integral as the expected value of f x h x I x when the population has density I x The new density I x is called an importance function The same strategy works if certain fairly benign conditions are imposed on the importance function See Geweke 1989 The range of variation is an important consideration for example If the range of x is say and we choose an importance function that is nonzero only in the range 0 then our strategy is likely to come to some dif culties
    Example E 4

    Consider computing the mean of a lognormal distribution If x N 0 1 then e x is distributed as lognormal and has density 1 2 g x e 1 2 ln x x 2 x 0

    Mean of a Lognormal Distribution

    see Section B 4 4 The expected value is x g x dx so f x x Suppose that we did not know how to draw a random sample from this lognormal distribution by just exponentiating our draws from the standard normal distribution or that the true mean is exp 0 5 1 649 Consider using a 2 1 as an importance function instead This chi squared distribution is a gamma distribution with parameters P 1 see 3 39 so 2 I x After a bit of manipulation we nd that f x g x 2 q x e 1 2 x ln x x 1 2 I x Therefore to estimate the mean of this lognormal distribution we can draw a random sample of values xi from the 2 1 distribution which we can do by squaring the draws in a sample from the standard normal distribution then computing the average of the sample of values q xi We carried out this experiment with 1000 draws from a standard normal distribution The mean of our sample was 1 6974 compared with a true mean of 1 649 so the error was less than 3 percent
    E 5 6 MULTIVARIATE NORMAL PROBABILITIES AND SIMULATED MOMENTS
    1 1 2 2 1 2

    x 1 2 e 1 2 x

    The computation of bivariate normal probabilities requires a large amount of computing effort Quadrature methods have been developed for trivariate probabilities as well but the amount of computing effort needed at this level is enormous For integrals of level greater than three satisfactory in terms of speed and accuracy direct approximations remain to be developed Our work thus far does suggest an alternative approach Suppose that x has a K variate normal distribution with mean vector 0 and covariance matrix No generality is sacri ced by the assumption of a zero mean since we could just subtract a nonzero mean from the random vector wherever it appears in any result We wish to compute the K variate probability Prob a1 x1 b1 a2 x2 b2 a K xK bK Our Monte Carlo integration technique is well suited for this well de ned problem As a rst approach consider sampling R observations xr r 1 R

    Greene 50240

    book

    June 28 2002

    14 40

    932

    APPENDIX E Computation and Optimization

    from this multivariate normal distribution using the method described in Section E 2 Now de ne dr 1 a1 xr 1 b1 a2 xr 2 b2 a K xr K bK That is dr 1 if the condition is true and 0 otherwise Based on our earlier results it follows that plim d plim 1 R
    R

    dr Prob a1 x1 b1 a2 x2 b2 a K xK bK 16
    r 1

    This method is valid in principle but in practice it has proved to be unsatisfactory for several reasons For large order problems it requires an enormous number of draws from the distribution to give reasonable accuracy Also even with large numbers of draws it appears to be problematic when the desired tail area is very small Nonetheless the idea is sound and recent research has built on this idea to produce some quite accurate and ef cient simulation methods for this computation A survey of the methods is given in McFadden and Ruud 1994 17 Among the simulation methods examined in the survey the GHK smooth recursive simulator appears to be the most accurate 18 The method is surprisingly simple The general approach uses Prob a1 x1 b1 a2 x2 b2 a K xK bK 1 R
    R K

    Qr k
    r 1 k 1

    where Qr k are easily computed univariate probabilities The probabilities Qr k are computed according to the following recursion We rst factor using the Cholesky factorization LL where L is a lower triangular matrix see Section 2 7 11 The elements of L are lkm where lkm 0 if m k Then we begin the recursion with Qr 1 b1 l11 a1 l11

    Note that l11 11 so this is just the marginal probability Prob a1 x1 b1 Now generate a random observation r 1 from the truncated standard normal distribution in the range Ar 1 to Br 1 a1 l11 to b1 l11 Note again that the range is standardized since l11 11 The draw can be obtained from a U 0 1 observation using 5 1 For steps k 2 K compute
    k 1

    Ar k ak
    m 1 k 1

    lkm r m

    lkk

    Br k bk
    m 1

    lkm r m

    lkk

    Then Qr k
    16 This

    Br k

    Ar k

    method was suggested by Lerman and Manski 1981

    17 A symposium on the topic of simulation methods appears in Review of Economic Statistics Vol 76 Novem

    ber 1994 See especially McFadden and Ruud 1994 Stern 1994 Geweke Keane and Runkle 1994 and Breslaw 1994 See as well Gourieroux and Monfort 1996
    18 See

    Geweke 1989 Hajivassiliou 1990 and Keane 1994 Details on the properties of the simulator are given in Borsch Supan and Hajivassiliou 1990

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    933

    and in preparation for the next step in the recursion we generate a random draw from the truncated standard normal distribution in the range Ar k to Br k This process is replicated R times and the estimated probability is the sample average of the simulated probabilities The GHK simulator has been found to be impressively fast and accurate for fairly moderate numbers of replications Its main usage has been in computing functions and derivatives for maximum likelihood estimation of models that involve multivariate normal integrals We will revisit this in the context of the method of simulated moments when we examine the probit model in Chapter 21
    E 5 7 COMPUTING DERIVATIVES

    For certain functions the programming of derivatives may be quite dif cult Numeric approximations can be used although it should be borne in mind that analytic derivatives obtained by formally differentiating the functions involved are to be preferred First derivatives can be approximated by using F F i F i i 2 The choice of is a remaining problem Extensive discussion may be found in Quandt 1983 There are three drawbacks to this means of computing derivatives compared with using the analytic derivatives A possible major consideration is that it may substantially increase the amount of computation needed to obtain a function and its gradient In particular K 1 function evaluations the criterion and K derivatives are replaced with 2 K 1 functions The latter may be more burdensome than the former depending on the complexity of the partial derivatives compared with the function itself The comparison will depend on the application But in most settings careful programming that avoids super uous or redundant calculation can make the advantage of the analytic derivatives substantial Second the choice of can be problematic If it is chosen too large then the approximation will be inaccurate If it is chosen too small then there may be insuf cient variation in the function to produce a good estimate of the derivative A compromise that is likely to be effective is to compute i separately for each parameter as in i Max i see Goldfeld and Quandt 1971 The values and should be relatively small such as 10 5 Third although numeric derivatives computed in this fashion are likely to be reasonably accurate in a sum of a large number of terms say several thousand enough approximation error can accumulate to cause the numerical derivatives to differ signi cantly from their analytic counterparts Second derivatives can also be computed numerically In addition to the preceding problems however it is generally not possible to ensure negative de niteness of a Hessian computed in this manner Unless the choice of e is made extremely carefully an inde nite matrix is a possibility In general the use of numeric derivatives should be avoided if the analytic derivatives are available

    E 6

    OPTIMIZATION

    Nonlinear optimization e g maximizing log likelihood functions is an intriguing practical problem Theory provides few hard and fast rules and there are relatively few cases in which it is obvious how to proceed This section introduces some of the terminology and underlying theory

    Greene 50240

    book

    June 28 2002

    14 40

    934

    APPENDIX E Computation and Optimization

    of nonlinear optimization 19 We begin with a general discussion on how to search for a solution to a nonlinear optimization problem and describe some speci c commonly used methods We then consider some practical problems that arise in optimization An example is given in the nal section Consider maximizing the quadratic function F a b 1 C 2 where C is a positive de nite matrix The rst order condition for a maximum is F b C 0 This set of linear equations has the unique solution C 1 b E 3 E 2

    This is a linear optimization problem Note that it has a closed form solution for any a b and C the solution can be computed directly 20 In the more typical situation F 0 E 4

    is a set of nonlinear equations that cannot be solved explicitly for 21 The techniques considered in this section provide systematic means of searching for a solution We now consider the general problem of maximizing a function of several variables maximize F E 5

    where F may be a log likelihood or some other function Minimization of F is handled by maximizing F Two special cases are
    n

    F
    i 1

    fi

    E 6

    which is typical for maximum likelihood problems and the least squares problem 22 fi yi f xi 2 E 7

    We will treat the nonlinear least squares problem in detail in Chapter 9 An obvious way to search for the that maximizes F is by trial and error If has only a single element and it is known approximately where the optimum will be found then a grid search will be a feasible strategy An example is a common time series problem in which a one dimensional search for a correlation coef cient is made in the interval 1 1 The grid search can proceed in the obvious fashion that is 0 1 0 0 1 0 2 then max 0 1 to max 0 1 in increments of 0 01 and so on until the desired precision is achieved 23 If contains more than one parameter then a
    19 There

    are numerous excellent references that offer a more complete exposition Among these are Quandt 1983 Bazzara and Shetty 1979 and Fletcher 1980 that the constant a is irrelevant to the solution Many maximum likelihood problems are presented with the preface neglecting an irrelevant constant For example the log likelihood for the normal linear regression model contains a term n 2 ln 2 that can be discarded for example the normal equations for the nonlinear least squares estimators of Chapter 9 squares is of course a minimizing problem The negative of the criterion is used to maintain consistency with the general formulation

    20 Notice

    21 See

    22 Least

    23 There are more ef cient methods of carrying out a one dimensional search for example the golden section

    method See Press et al 1986 Chap 10

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    935

    grid search is likely to be extremely costly particularly if little is known about the parameter vector at the outset Nonetheless relatively ef cient methods have been devised Quandt 1983 and Fletcher 1980 contain further details There are also systematic derivative free methods of searching for a function optimum that resemble in some respects the algorithms that we will examine in the next section The downhill simplex and other simplex methods24 have been found to be very fast and effective for some problems A recent entry in the econometric literature is the method of simulated annealing 25 These derivative free methods particularly the latter are often very effective in problems with many variables in the objective function but they usually require far more function evaluations than the methods based on derivatives that are considered below Since the problems typically analyzed in econometrics involve relatively few parameters but often quite complex functions involving large numbers of terms in a summation on balance the gradient methods are usually going to be preferable 26
    E 6 1 ALGORITHMS

    A more effective means of solving most nonlinear maximization problems is by an iterative algorithm Beginning from initial value 0 at entry to iteration t if t is not the optimal value for compute direction vector t step size t then t 1 t t
    t

    E 8

    Figure E 2 illustrates the structure of an iteration for a hypothetical function of two variables The direction vector t is shown in the gure with t The dashed line is the set of points t t t Different values of t lead to different contours for this t and t the best value of t is about 0 5 Notice in Figure E 2 that for a given direction vector t and current parameter vector t a secondary optimization is required to nd the best t Translating from Figure E 2 we obtain the form of this problem as shown in Figure E 3 This subsidiary search is called a line search as we search along the line t t t for the optimal value of F The formal solution to the line search problem would be the t that satis es F t t t
    t

    g t t

    t

    t

    0

    E 9

    where g is the vector of partial derivatives of F evaluated at t t t In general this problem will also be a nonlinear one In most cases adding a formal search for t will be too expensive as well as unnecessary Some approximate or ad hoc method will usually be chosen It is worth emphasizing that nding the t that maximizes F t t t at a given iteration does not generally lead to the overall solution in that iteration This situation is clear in Figure E 3 where the optimal value of t leads to F 2 0 at which point we reenter the iteration
    E 6 2 GRADIENT METHODS

    The most commonly used algorithms are gradient methods in which
    t

    Wt gt

    E 10

    24 See 25 See

    Nelder and Mead 1965 and Press et al 1986 Goffe Ferrier and Rodgers 1994 and Press et al 1986 pp 326 334

    26 Goffe

    Ferrier and Rodgers 1994 did nd that the method of simulated annealing was quite adept at nding the best among multiple solutions This problem is frequent for derivative based methods because they usually have no method of distinguishing between a local optimum and a global one

    Greene 50240

    book

    June 28 2002

    14 40

    936

    APPENDIX E Computation and Optimization
    2

    t

    t

    2 3 2 2 2 1 1 8 1 9 2 0
    1

    FIGURE E 2

    Iteration

    2 0

    1 95
    t t

    F 1 9 0
    FIGURE E 3

    t

    0 5
    Line Search

    1
    t

    1 5

    2

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    937

    where Wt is a positive de nite matrix and gt is the gradient of F t gt g t F t t E 11

    These methods are motivated partly by the following Consider a linear Taylor series approximation to F t t t around t 0 F t t Let F t t
    t t

    F t 1 g t

    t

    E 12

    equal Ft 1 Then Ft 1 Ft 1 gt
    t

    If

    t

    Wt g t then Ft 1 Ft 1 gt Wt gt

    If gt is not 0 and t is small enough then Ft 1 Ft must be positive Thus if F is not already at its maximum then we can always nd a step size such that a gradient type iteration will lead to an increase in the function Recall that Wt is assumed to be positive de nite In the following we will omit the iteration index t except where it is necessary to distinguish one vector from another The following are some commonly used algorithms 27

    Steepest Ascent The simplest algorithm to employ is the steepest ascent method which uses
    W I so that g E 13

    As its name implies the direction is the one of greatest increase of F Another virtue is that the line search has a straightforward solution at least near the maximum the optimal is where H Therefore the steepest ascent iteration is t 1 t gt gt gt gt Ht gt E 15 2 F g g g Hg E 14

    Computation of the second derivatives matrix may be extremely burdensome Also if Ht is not negative de nite which is likely if t is far from the maximum the iteration may diverge A systematic line search can bypass this problem This algorithm usually converges very slowly however so other techniques are usually used

    Newton s Method The template for most gradient methods in common use is Newton s
    method The basis for Newton s method is a linear Taylor series approximation Expanding the rst order conditions F 0
    27 A more extensive catalog may be found in Judge et al 1985 Appendix B Those mentioned here are some

    of the more commonly used ones and are chosen primarily because they illustrate many of the important aspects of nonlinear optimization

    Greene 50240

    book

    June 28 2002

    14 40

    938

    APPENDIX E Computation and Optimization

    equation by equation in a linear Taylor series around an arbitrary 0 yields F g0 H0 0 0 E 16

    where the superscript indicates that the term is evaluated at 0 Solving for and then equating to t 1 and 0 to t we obtain the iteration t 1 t H 1 g t t Thus for Newton s method W H 1 H 1 g 1 E 18 E 17

    Newton s method will converge very rapidly in many problems If the function is quadratic then this method will reach the optimum in one iteration from any starting point If the criterion function is globally concave as it is in a number of problems that we shall examine in this text then it is probably the best algorithm available This method is very well suited to maximum likelihood estimation

    Alternatives to Newton s Method Newton s method is very effective in some settings but it can perform very poorly in others If the function is not approximately quadratic or if the current estimate is very far from the maximum then it can cause wide swings in the estimates and even fail to converge at all A number of algorithms have been devised to improve upon Newton s method An obvious one is to include a line search at each iteration rather than use 1 Two problems remain however At points distant from the optimum the second derivatives matrix may not be negative de nite and in any event the computational burden of computing H may be excessive The quadratic hill climbing method proposed by Goldfeld Quandt and Trotter 1966 deals directly with the rst of these problems In any iteration if H is not negative de nite then it is replaced with
    H H I E 19

    where is a positive number chosen large enough to ensure the negative de niteness of H Another suggestion is that of Greenstadt 1967 which uses at every iteration
    n

    H
    i 1

    i ci ci

    E 20

    where i is the ith characteristic root of H and ci is its associated characteristic vector Other proposals have been made to ensure the negative de niteness of the required matrix at each iteration 28 The computational complexity of these methods remains a problem however

    Quasi Newton Methods Davidon Fletcher Powell A very effective class of algorithms has been developed that eliminates second derivatives altogether and has excellent convergence properties even for ill behaved problems These are the quasi Newton methods which form
    Wt 1 Wt Et where Et is a positive de nite matrix 29 As long as W0 is positive de nite I is commonly used Wt will be positive de nite at every iteration In the Davidon Fletcher Powell DFP method
    28 See 29 See

    for example Goldfeld and Quandt 1971 Fletcher 1980

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    939

    after a suf cient number of iterations Wt 1 will be an approximation to H 1 Let t t
    t

    and

    t g t 1 g t

    E 21

    The DFP variable metric algorithm uses Wt 1 Wt t t Wt t t Wt t t t Wt t E 22

    Notice that in the DFP algorithm the change in the rst derivative vector is used in W an estimate of the inverse of the second derivatives matrix is being accumulated The variable metric algorithms are those that update W at each iteration while preserving its de niteness For the DFP method the accumulation of Wt 1 is of the form Wt 1 Wt aa bb Wt a b a b

    The two column matrix a b will have rank two hence DFP is called a rank two update or rank two correction The Broyden Fletcher Goldfarb Shanno BFGS method is a rank three correction that subtracts v dd from the DFP update where v t Wt t and dt 1 t t t 1 t Wt t Wt t

    There is some evidence that this method is more ef cient than DFP Other methods such as Broyden s method involve a rank one correction instead Any method that is of the form Wt 1 Wt QQ will preserve the de niteness of W regardless of the number of columns in Q The DFP and BFGS algorithms are extremely effective and are among the most widely used of the gradient methods An important practical consideration to keep in mind is that although Wt accumulates an estimate of the negative inverse of the second derivatives matrix for both algorithms in maximum likelihood problems it rarely converges to a very good estimate of the covariance matrix of the estimator and should generally not be used as one
    E 6 3 ASPECTS OF MAXIMUM LIKELIHOOD ESTIMATION

    Newton s method is often used for maximum likelihood problems For solving a maximum likelihood problem the method of scoring replaces H with H E H E 23

    which will be recognized as the asymptotic variance of the maximum likelihood estimator There is some evidence that where it can be used this method performs better than Newton s method The exact form of the expectation of the Hessian of the log likelihood is rarely known however 30 Newton s method which uses actual instead of expected second derivatives is generally used instead

    One Step Estimation A convenient variant of Newton s method is the one step maximum likelihood estimator It has been shown that if 0 is any consistent initial estimator of and H is H H or any other asymptotically equivalent estimator of Var g MLE then
    1 0 H 1 g0
    30 Amemiya

    E 24

    1981 provides a number of examples

    Greene 50240

    book

    June 28 2002

    14 40

    940

    APPENDIX E Computation and Optimization

    is an estimator of that has the same asymptotic properties as the maximum likelihood estimator 31 Note that it is not the maximum likelihood estimator As such for example it should not be used as the basis for likelihood ratio tests

    Covariance Matrix Estimation In computing maximum likelihood estimators a commonly used method of estimating H simultaneously simpli es the calculation of W and solves the occasional problem of inde niteness of the Hessian The method of Berndt et al 1974 replaces W with
    n 1

    W
    i 1

    gi gi

    G G 1

    E 25

    where gi ln f yi xi E 26

    Then G is the n K matrix with ith row equal to gi Although W and other suggested estimators of 1 has the additional virtues that it is always nonnegative H are asymptotically equivalent W de nite and it is only necessary to differentiate the log likelihood once to compute it The Lagrange Multiplier Statistic The use of W as an estimator of H 1 brings another intriguing convenience in maximum likelihood estimation When testing restrictions on parameters estimated by maximum likelihood one approach is to use the Lagrange multiplier statistic We will examine this test at length at various points in this book so we need only sketch it brie y here The logic of the LM test is as follows The gradient g of the log likelihood function equals 0 at the unrestricted maximum likelihood estimators that is at least to within the precision of the computer program in use If r is an MLE that is computed subject to some restrictions on then we know that g r 0 The LM test is used to test whether at r gr is signi cantly different from 0 or whether the deviation of gr from 0 can be viewed as sampling variation The covariance matrix of the gradient of the log likelihood is H so the Wald statistic for testing this hypothesis is W g H 1 g Now suppose that we use W to estimate H 1 Let G be the n K matrix with ith row equal to gi and let i denote an n 1 column of ones Then the LM statistic can be computed as LM i G G G 1 G i Since i i n LM n i G G G 1 G i n nRi2 where Ri2 is the uncentered R2 in a regression of a column of ones on the derivatives of the log likelihood function

    The Concentrated Log Likelihood Many problems in maximum likelihood estimation can be formulated in terms of a partitioning of the parameter vector 1 2 such that at the solution to the optimization problem 2 ML can be written as an explicit function of 1 ML When the 3 solution to the likelihood equation for 2 produces
    2 ML t 1 ML then if it is convenient we may concentrate the log likelihood function by writing F 1 2 F 1 t 1 Fc 1
    31 See

    for example Rao 1973

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    941

    The unrestricted solution to the problem Max 1 Fc 1 provides the full solution to the optimization problem Once the optimizing value of 1 is obtained the optimizing value of 2 is simply t 1 ML Note that F 1 2 is a subset of the set of values of the log likelihood function namely those values at which the second parameter vector satis es the rst order conditions 32

    E 6 4

    OPTIMIZATION WITH CONSTRAINTS

    Occasionally some of or all the parameters of a model are constrained for example to be positive in the case of a variance or to be in a certain range such as a correlation coef cient Optimization subject to constraints is often yet another art form The elaborate literature on the general problem provides some guidance see for example Appendix B in Judge et al 1985 but applications still as often as not require some creativity on the part of the analyst In this section we will examine a few of the most common forms of constrained optimization as they arise in econometrics Parametric constraints typically come in two forms which may occur simultaneously in a problem Equality constraints can be written c 0 where c j is a continuous and differentiable function Typical applications include linear constraints on slope vectors such as a requirement that a set of elasticities in a log linear model add to one exclusion restrictions which are often cast in the form of interesting hypotheses about whether or not a variable should appear in a model i e whether a coef cient is zero or not and equality restrictions such as the symmetry restrictions in a translog model which require that parameters in two different equations be equal to each other Inequality constraints in general will be of the form a j c j b j where a j and b j are known constants either of which may be in nite Once again the typical application in econometrics involves a restriction on a single parameter such as 0 for a variance parameter 1 1 for a correlation coef cient or j 0 for a particular slope coef cient in a model We will consider the two cases separately In the case of equality constraints for practical purposes of optimization there are usually two strategies available One can use a Lagrangean multiplier approach The new optimization problem is Max L F c The necessary conditions for an optimum are L g C 0 L c 0 where g is the familiar gradient of F and C is a J K matrix of derivatives with jth row equal to c j The joint solution will provide the constrained optimizer as well as the Lagrange multipliers which are often interesting in their own right The disadvantage of this approach is that it increases the dimensionality of the optimization problem An alternative strategy is to eliminate some of the parameters by either imposing the constraints directly on the function or by solving out the constraints For exclusion restrictions which are usually of the form j 0 this step usually means dropping a variable from a model Other restrictions can often be imposed

    32 A

    formal proof that this is a valid way to proceed is given by Amemiya 1985 pp 125 127

    Greene 50240

    book

    June 28 2002

    14 40

    942

    APPENDIX E Computation and Optimization

    just by building them into the model For example in a function of 1 2 and 3 if the restriction is of the form 3 1 2 then 3 can be eliminated from the model by a direct substitution Inequality constraints are more dif cult For the general case one suggestion is to transform the constrained problem into an unconstrained one by imposing some sort of penalty function into the optimization criterion that will cause a parameter vector that violates the constraints or nearly does so to be an unattractive choice For example to force a parameter j to be nonzero one might maximize the augmented function F 1 j This approach is feasible but it has the disadvantage that because the penalty is a function of the parameters different penalty functions will lead to different solutions of the optimization problem For the most common problems in econometrics a simpler approach will usually suf ce One can often reparameterize a function so that the new parameter is unconstrained For example the method of squaring is sometimes used to force a parameter to be positive If we require j to be positive then we can de ne j 2 and substitute 2 for j wherever it appears in the model Then an unconstrained solution for is obtained An alternative reparameterization for a parameter that must be positive that is often used is j exp To force a parameter to be between zero and one we can use the function j 1 1 exp The range of is now unrestricted Experience suggests that a third less orthodox approach works very well for many problems When the constrained optimization is begun there is a starting value 0 that begins the iterations Presumably 0 obeys the restrictions If not and none can be found then the optimization process must be terminated immediately The next iterate 1 is a step away from 0 by 1 0 0 0 Suppose that 1 violates the constraints By construction we know that there is some value 1 between 0 and 1 that does not violate the constraint where between means only that a shorter step is taken Therefore the next value for the iteration can be 1 The logic is true at every iteration so a way to proceed is to alter the iteration so that the step length is shortened when necessary when a parameter violates the constraints
    E 6 5 SOME PRACTICAL CONSIDERATIONS

    The reasons for the good performance of many algorithms including DFP are unknown Moreover different algorithms may perform differently in given settings Indeed for some problems one algorithm may fail to converge whereas another will succeed in nding a solution without great dif culty In view of this computer programs such as GQOPT33 and Gauss that offer a menu of different preprogrammed algorithms can be particularly useful It is sometimes worth the effort to try more than one algorithm on a given problem

    Step Sizes Except for the steepest ascent case an optimal line search is likely to be infeasible or to require more effort than it is worth in view of the potentially large number of function evaluations required In most cases the choice of a step size is likely to be rather ad hoc But within limits the most widely used algorithms appear to be robust to inaccurate line searches For example one method employed by the widely used TSP computer program34 is the method of squeezing which tries 1 1 1 and so on until an improvement in the function results 24 Although this approach is obviously a bit unorthodox it appears to be quite effective when used with the Gauss Newton method for nonlinear least squares problems See Chapter 9 A somewhat more elaborate rule is suggested by Berndt et al 1974 Choose an between 0 and 1 and then nd a such that 2
    F F 1 g E 27

    33 Goldfeld 34 Hall

    and Quandt 1972

    1982 p 147

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    943

    Of course which value of to choose is still open so the choice of remains ad hoc Moreover in neither of these cases is there any optimality to the choice we merely nd a that leads to a function improvement Other authors have devised relatively ef cient means of searching for a step size without doing the full optimization at each iteration 35

    Assessing Convergence Ideally the iterative procedure should terminate when the gradient is zero In practice this step will not be possible primarily because of accumulated rounding error in the computation of the function and its derivatives Therefore a number of alternative convergence criteria are used Most of them are based on the relative changes in the function or the parameters There is considerable variation in those used in different computer programs and there are some pitfalls that should be avoided A critical absolute value for the elements of the gradient or its norm will be affected by any scaling of the function such as normalizing it by the sample size Similarly stopping on the basis of small absolute changes in the parameters can lead to premature convergence when the parameter vector approaches the maximizer It is probably best to use several criteria simultaneously such as the proportional change in both the function and the parameters Belsley 1980 discusses a number of possible stopping rules One that has proved useful and is immune to the scaling problem is to base convergence on g H 1 g Multiple Solutions It is possible for a function to have several local extrema It is dif cult to know a priori whether this is true of the one at hand But if the function is not globally concave then it may be a good idea to attempt to maximize it from several starting points to ensure that the maximum obtained is the global one Ideally a starting value near the optimum can facilitate matters in some settings this can be obtained by using a consistent estimate of the parameter for the starting point The method of moments if available is sometimes a convenient device for doing so No Solution Finally it should be noted that in a nonlinear setting the iterative algorithm can break down even in the absence of constraints for at least two reasons The rst possibility is that the problem being solved may be so numerically complex as to defy solution The second possibility which is often neglected is that the proposed model may simply be inappropriate for the data In a linear setting a low R2 or some other diagnostic test may suggest that the model and data are mismatched but as long as the full rank condition is met by the regressor matrix a linear regression can always be computed Nonlinear models are not so forgiving The failure of an iterative algorithm to nd a maximum of the criterion function may be a warning that the model is not appropriate for this body of data
    E 6 6 Examples

    To illustrate the use of gradient methods we consider several simple problems
    E 6 6 a Function of One Parameter

    First consider maximizing a function of a single variable f ln 0 1 2 The function is shown in Figure E 4 The rst and second derivatives are 1 0 2 1 f 2 0 2 f
    35 See

    for example Joreskog and Gruvaeus 1970 Powell 1964 Quandt 1983 and Hall 1982

    Greene 50240

    book

    June 28 2002

    14 40

    944

    APPENDIX E Computation and Optimization

    0 40 0 20 0 00 Function 0 20 0 40 0 60 0 80 1 00 0 50 1 00 1 50 2 00 2 50 3 00 3 50 4 00 4 50 5 00

    FIGURE E 4

    Function of One Variable Parameter

    TABLE E 1 Iteration

    Iterations for Newton s Method
    f f f

    0 1 2 3 4

    5 00000 1 66667 2 14286 2 23404 2 23607

    0 890562 0 233048 0 302956 0 304718 0 304719

    0 800000 0 266667 0 030952 0 000811 0 0000004

    0 240000 0 560000 0 417778 0 400363 0 400000

    Equating f to zero yields the simple solution 5 2 236 At the solution f 0 4 so this equation is indeed a maximum To demonstrate the use of an iterative method we solve this problem using Newton s method Observe rst that the second derivative is always negative for any admissible positive 36 Therefore it should not matter where we start the iterations we shall eventually nd the maximum For a single parameter Newton s method is t 1 t ft ft The sequence of values that results when 5 is used as the starting value is given in Table E 1 The path of the iterations is also shown in the table
    E 6 6 b Function of Two Parameters The Gamma Distribution

    For random sampling from the gamma distribution f yi
    36 In

    yi 1 e yi

    this problem an inequality restriction 0 is required As is common however for our rst attempt we shall neglect the constraint

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX E Computation and Optimization

    945

    TABLE E 2

    Iterative Solutions to Max ln ln 3 1
    Trial 1 Trial 2 Newton DFP Newton DFP Trial 3 Newton

    DFP Iter

    0 1 2 3 4 5 6

    4 000 3 981 4 005 5 217 5 233

    1 000 1 345 1 324 1 743 1 744

    4 000 3 812 4 795 5 190 5 231

    1 000 1 203 1 577 1 728 1 744

    8 000 7 117 7 144 7 045 5 114 5 239 5 233

    3 000 2 518 2 372 2 389 1 710 1 747 1 744

    8 000 2 640 3 203 4 257 5 011 5 219 5 233

    3 000 0 615 0 931 1 357 1 656 1 740 1 744

    2 000 6 663 6 195 5 239 5 251 5 233

    7 000 2 027 2 075 1 731 1 754 1 744

    2 000 7 000 47 7 233

    The log likelihood is ln L n ln n ln i 1 yi 1 i 1 ln yi See Section 4 9 4 It is often convenient to scale the log likelihood by the sample size Suppose as well that we have a sample with y 3 and ln y 1 Then the function to be maximized is F ln ln 3 1 The derivatives are F 3 2 F 2 2 F ln 2 F 2
    2

    n

    n

    1 ln
    2

    1 1 2 F







    Finding a good set of starting values is often a dif cult problem Here we choose three starting points somewhat arbitrarily 0 0 4 1 8 3 and 2 7 The solution to the problem is 5 233 1 7438 We used Newton s method and DFP with a line search to maximize this function 37 For Newton s method 1 The results are shown in Table E 2 The two methods were essentially the same when starting from a good starting point trial 1 but they differed substantially when starting from a poorer one trial 2 Note that DFP and Newton approached the solution from different directions in trial 2 The third starting point shows the value of a line search At this starting value the Hessian is extremely large and the second value for the parameter vector with Newton s method is 47 671 233 35 at which point F cannot be computed and this method must be abandoned Beginning with H I and using a line search DFP reaches the point 6 63 2 03 at the rst iteration after which convergence occurs routinely in three more iterations At the solution the Hessian is 1 72038 0 191153 0 191153 0 210579 The diagonal elements of the Hessian are negative and its determinant is 0 32574 so it is negative de nite The two characteristic roots are 1 7442 and 0 18675 Therefore this result is indeed the maximum of the function
    E 6 6 c A Concentrated Log Likelihood Function

    There is another way that the preceding problem might have been solved The rst of the necessary conditions implies that at the joint solution for will equal 3 Suppose that we impose this requirement on the function we are maximizing The concentrated over log likelihood function is then produced Fc ln 3 ln 3 3 1 ln 3 ln 1
    37 The

    one used is described in Joreskog and Gruvaeus 1970

    Greene 50240

    book

    June 28 2002

    14 40

    946

    APPENDIX F Data Sets Used in Applications

    1 50 1 75 2 00 2 25 Function 2 50 2 75 3 00 3 25 3 50 3 75 0
    FIGURE E 5

    2

    4 P

    6

    8

    10

    Concentrated Log Likelihood

    This function could be maximized by an iterative search or by a simple one dimensional grid search Figure E 5 shows the behavior of the function As expected the maximum occurs at 5 233 The value of is found as 5 23 3 1 743 The concentrated log likelihood is a useful device in many problems Note the interpretation of the function plotted in Figure E 5 The original function of and is a surface in three dimensions The curve in Figure E 5 is a projection of that function it is a plot of the function values above the line 3 By virtue of the rst order condition we know that one of these points will be the maximizer of the function Therefore we may restrict our search for the overall maximum of F to the points on this line

    APPENDIX F

    Q
    DATA SETS USED IN APPLICATIONS
    The following tables list the variables in the data sets used in the applications in the text The data sets themselves can be downloaded from the website for the text
    TABLE F1 1

    Consumption and Income 10 Yearly Observations 1970 1979

    C Consumption and Y Disposable Income
    Source Council of Economic Advisors Economic Report of the President Washington D C U S Government Printing Of ce 1987

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX F Data Sets Used in Applications

    947

    TABLE F2 1

    Consumption and Income 11 Yearly Observations 1940 1950

    Year Date X Disposable Income C Consumption W War years dummy variable one in 1942 1945 zero other years
    Source Economic Report of the President Washington D C U S Government Printing Of ce 1983

    TABLE F2 2

    The U S Gasoline Market 36 Yearly Observations 1960 1995

    G Total U S gasoline consumption computed as total expenditure divided by price index Pg Price index for gasoline Y Per capita disposable income Pnc Price index for new cars Puc Price index for used cars Ppt Price index for public transportation Pd Aggregate price index for consumer durables Pn Aggregate price index for consumer nondurables Ps Aggregate price index for consumer services Pop U S total population in millions
    Source Council of Economic Advisors Economic Report of the President 1996 Washington D C U S Government Printing Of ce 1996

    TABLE F3 1

    Investment 15 Yearly Observations 1968 1982

    Year Date GNP Nominal GNP Invest Nominal Investment CPI Consumer price index Interest Interest rate
    Source Economic Report of the President Washington D C U S Government Printing Of ce 1983 CPI 1967 is 79 06 The interest rate is the average yearly discount rate at the New York Federal Reserve Bank

    TABLE F4 1

    Labor Supply Data from Mroz 1987

    LFP A dummy variable 1 if woman worked in 1975 else 0 WHRS Wife s hours of work in 1975 KL6 Number of children less than 6 years old in household K618 Number of children between ages 6 and 18 in household WA Wife s age WE Wife s educational attainment in years WW Wife s average hourly earnings in 1975 dollars RPWG Wife s wage reported at the time of the 1976 interview not 1975 estimated wage HHRS Husband s hours worked in 1975 HA Husband s age HE Husband s educational attainment in years HW Husband s wage in 1975 dollars FAMINC Family income in 1975 dollars WMED Wife s mother s educational attainment in years WFED Wife s father s educational attainment in years UN Unemployment rate in county of residence in percentage points CIT Dummy variable one if live in large city SMSA else zero AX Actual years of wife s previous labor market experience
    Source 1976 Panel Study of Income Dynamics Mroz 1987

    Greene 50240

    book

    June 28 2002

    14 40

    948

    APPENDIX F Data Sets Used in Applications

    TABLE F4 2

    The Longley Data 15 Yearly Observations 1947 1962

    Employ Employment 1 000s Price GNP de ator GNP Nominal GNP millions Armed Armed forces Year Date
    Source Longley 1967

    TABLE F5 1

    Macroeconomics Data Set Quarterly 1950I to 2000IV

    Year Date Qtr Quarter Realgdp Real GDP bil Realcons Real consumption expenditures Realinvs Real investment by private sector Realgovt Real government expenditures Realdpi Real disposable personal income CPI U Consumer price index M1 Nominal money stock Tbilrate Quarterly average of month end 90 day t bill rate Unemp Unemployment rate Pop Population mil interpolate of year end gures using constant growth rate per quarter In Rate of in ation rst observation is missing Realint Ex post real interest rate Tbilrate In First observation missing
    Source Department of Commerce BEA website and www economagic com

    TABLE F5 2

    Cost Function 123 1970 Cross section Firm Level Observations

    Id Observation Year 1970 for all observations Cost Total cost Q Total output Pl Wage rate Sl cost share for labor Pk Capital price index Sk Cost share for capital Pf Fuel price Sf Cost share for fuel
    Source Christensen and Greene 1976 Note the le contains some extra observations These are the holding companies Use only the rst 123 observations to replicate Christensen and Greene

    TABLE F6 1

    Production for SIC 33 Primary Metals 27 Statewide Observations

    Obs Observation number Valueadd Value added Labor Labor input Capital Capital stock
    Note Data are per establishment labor is a measure of labor input and capital is the gross value of plant and equipment A scale factor used to normalize the capital gure in the original study has been omitted Further details on construction of the data are given in Aigner et al 1977 and in Hildebrand and Liu 1957

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX F Data Sets Used in Applications

    949

    TABLE F7 1

    Costs for U S Airlines 90 Observations on 6 Firms for 1970 1984

    I Airline T Year Q Output in revenue passenger miles index number C Total cost in 1 000 PF Fuel price LF Load factor the average capacity utilization of the eet
    Source These data are a subset of a larger data set provided to the author by Professor Moshe Kim They were originally constructed by Christensen Associates of Madison Wisconsin

    TABLE F7 2

    Solow s Technological Change Data 41 Yearly Observations 1909 1949

    Year Date Q Output K Capital labor ratio A Index of technology
    Source Solow 1957 p 314 Several Variables are omitted

    TABLE F8 1

    Pesaran and Hall In ation Data

    Pa Actual in ation Pe Expected in ation
    Source Pesaran and Hall 1988

    TABLE F9 1

    Income and Expenditure Data 100 Cross Section Observations

    MDR Number of derogatory reports Acc Credit card application accepted 1 yes Age Age in years 12ths of a year Income Income divided by 10 000 Avgexp Avg monthly credit card expenditure Ownrent OwnRent individual owns 1 or rents 0 home Selfempl Self employed 1 yes 0 no
    Source Greene 1992

    TABLE F9 2

    Statewide Data on Transportation Equipment Manufacturing 25 Observations

    State Observation ValueAdd output Capita capital input Labor labor input N rm number of rms
    Source A Zellner and N Revankar 1970 p 249 Note Value added Capital and Labor are in millions of 1957 dollars Data used for regression examples are per establishment Raw data are used for the stochastic frontier application in Chapter 16

    TABLE F11 1

    Bollerslev and Ghysels Exchange Rate Data 1974 Daily Observations

    Y Nominal return on Mark Pound exchange rate daily
    Source Bollerslev 1986

    Greene 50240

    book

    June 28 2002

    14 40

    950

    APPENDIX F Data Sets Used in Applications

    TABLE F13 1

    Grunfeld Investment Data 100 Yearly Observations on 5 Firms for 1935 1954

    I Gross investment from Moody s Industrial Manual and annual reports of corporations F Value of the rm from Bank and Quotation Record and Moody s Industrial Manual C Stock of plant and equipment from Survey of Current Business
    Source Moody s Industrial Manual Survey of Current Business

    TABLE F14 1

    Manufacturing Costs U S Economy 25 Yearly Observations 1947 1971

    Year Date Cost Cost index K Capital cost share L Labor cost share E Energy cost share M Materials cost share Pk Capital price Pl Labor price Pe Energy price Pm materials price
    Source Berndt and Wood 1975

    TABLE F14 2

    Cost Function 145 U S Electricity Producers 1955 Data Nerlove

    Firm Observation Year 1955 for all observations Cost Total cost Output Total output Pl Wage rate Sl Cost share for labor Pk Capital price index Sk Cost share for capital Pf Fuel price Sf Cost share for fuel
    Source Nerlove 1963 and Christensen and Greene 1976 Note The data le contains several extra observations that are aggregates of commonly owned rms Use only the rst 145 for analysis

    TABLE F15 1

    Klein s Model I 22 Yearly Observations 1920 1941

    Year Date C Consumption P Corporate pro ts Wp Private wage bill I Investment K1 previous year s capital stock X GNP Wg Government wage bill G Government spending T Taxes
    Source Klein 1950

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX F Data Sets Used in Applications

    951

    TABLE F16 1

    Bertschek and Lechner Binary Choice Data

    yit one if rm i realized a product innovation in year t and zero if not xit 2 log of sales xit 3 relative size ratio of employment in business unit to employment in the industry xit 4 ratio of industry imports to industry sales imports xit 5 ratio of industry foreign direct investment to industry sales imports xit 6 productivity ratio of industry value added to industry employment xit 7 dummy variable indicating rm is in the raw materials sector xit 8 dummy variable indicating rm is in the investment goods sector
    Note These data are proprietary Source Bertcshek and Lechner 1998

    TABLE F18 1

    Dahlberg and Johanssen Municipal Expenditure Data

    ID Identi cation Year Date Expend Expenditure Revenue Revenue from taxes and fees Grants Grants from Central Government
    Source Dahlberg and Johanssen 2000 Journal of Applied Econometrics data archive

    TABLE F20 1

    Bond Yield on a Moody s Aaa Rated Monthly 60 Monthly Observations 1990 1994

    Date Year Month Y Corporate bond rate in percent year
    Source National Income and Product Accounts U S Department of Commerce Bureau of Economic Analysis Survey of Current Business Business Statistics

    TABLE F20 2

    Money Output and Price De ator Data 136 Quarterly Observations 1950 1983

    Y Nominal GNP M1 M1 measure of money stock P Implicit price de ator for GNP
    Source National Income and Product Accounts U S Department of Commerce Bureau of Economic Analysis Survey of Current Business Business Statistics

    TABLE F21 1

    Program Effectiveness 32 Cross Section Observations

    Obs observation TUCE Test score on economics test PSI participation in program GRADE Grade increase 1 or decrease 0 indicator
    Source Spector and Mazzeo 1980

    TABLE F21 2

    Data Used to Study Travel Mode Choice 840 Observations on 4 Modes for 210 Individuals

    Mode choice Air Train Bus or Car Ttme terminal waiting time 0 for car Invc in vehicle cost cost component Invt travel time in vehicle GC generalized cost measure Hinc household income Psize party size in mode chosen
    Source Greene and Hensher 1997

    Greene 50240

    book

    June 28 2002

    14 40

    952

    APPENDIX F Data Sets Used in Applications

    TABLE F21 3

    Ship Accidents 40 Observations on 5 Types in 4 Vintages and 2 Service Periods

    Type Ship type TA TB TC TD TE Type indicators Y6064 Y6569 Y7074 Y7579 Year constructed indicators O6064 O7579 Years operated indicators Months Measure of service amount Acc Accidents
    Source McCullagh and Nelder 1983

    TABLE F21 4

    Expenditure and Default Data 1 319 Observations

    Cardhldr Dummy variable one if application for credit card accepted zero if not Majordrg Number of major derogatory reports Age Age n years plus twelfths of a year Income Yearly income divided by 10 000 Exp Inc Ratio of monthly credit card expenditure to yearly income Avgexp Average monthly credit card expenditure Ownrent 1 if owns their home 0 if rent Selfempl 1 if self employed 0 if not Depndt 1 number of dependents Inc per Income divided by number of dependents Cur add months living at current address Major number of major credit cards held Active number of active credit accounts
    Source Greene 1992

    TABLE F22 1

    Strike Duration Data 63 Observations in 9 Years 1968 1976

    Year Date T Strike duration in days PROD Unanticipated output
    Source Kennan 1985

    TABLE F22 2

    Fair s 1977 Extramarital Affairs Data 601 Cross section Observations

    y Number of affairs in the past year z1 Sex z2 Age z3 Number of years married z4 Children z5 Religiousness z6 Education z7 Occupation z8 Self rating of marriage Several variables not used are denoted X1 X5
    Source Fair 1977 and http fairmodel econ yale edu rayfair pdf 1978ADAT ZIP

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX G Statistical Tables

    953

    TABLE FD 1

    Observations on Income and Education 20 Observations

    I Observation Y Income
    Source Data are arti cial

    APPENDIX G

    Q
    STATISTICAL TABLES
    TABLE G 1 z 00

    Cumulative Normal Distribution Table Entry Is
    01 02 03 04 05 06

    z Prob Z z
    07 08 09

    0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 1 3 2 3 3 3 4

    5000 5398 5793 6179 6554 6915 7257 7580 7881 8159 8413 8643 8849 9032 9192 9332 9452 9554 9641 9713 9772 9821 9861 9893 9918 9938 9953 9965 9974 9981 9987 9990 9993 9995 9997

    5040 5438 5832 6217 6591 6950 7291 7611 7910 8186 8438 8665 8869 9049 9207 9345 9463 9564 9649 9719 9778 9826 9864 9896 9920 9940 9955 9966 9975 9982 9987 9991 9993 9995 9997

    5080 5478 5871 6255 6628 6985 7324 7642 7939 8212 8461 8686 8888 9066 9222 9357 9474 9573 9656 9726 9783 9830 9868 9898 9922 9941 9956 9967 9976 9982 9987 9991 9994 9995 9997

    5120 5517 5910 6293 6664 7019 7357 7673 7967 8238 8485 8708 8907 9082 9236 9370 9484 9582 9664 9732 9788 9834 9871 9901 9925 9943 9957 9968 9977 9983 9988 9991 9994 9996 9997

    5160 5557 5948 6331 6700 7054 7389 7704 7995 8264 8508 8729 8925 9099 9251 9382 9495 9591 9671 9738 9793 9838 9875 9904 9927 9945 9959 9969 9977 9984 9988 9992 9994 9996 9997

    5199 5596 5987 6368 6736 7088 7422 7734 8023 8289 8531 8749 8944 9115 9265 9394 9505 9599 9678 9744 9798 9842 9878 9906 9929 9946 9960 9970 9978 9984 9989 9992 9994 9996 9997

    5239 5636 6026 6406 6772 7123 7454 7764 8051 8315 8554 8770 8962 9131 9279 9406 9515 9608 9686 9750 9803 9846 9881 9909 9931 9948 9961 9971 9979 9985 9989 9992 9994 9996 9997

    5279 5675 6064 6443 6808 7157 7486 7794 8078 8340 8577 8790 8980 9147 9292 9418 9525 9616 9693 9756 9808 9850 9884 9911 9932 9949 9962 9972 9979 9985 9989 9992 9995 9996 9997

    5319 5714 6103 6480 6844 7190 7517 7823 8106 8365 8599 8810 8997 9162 9306 9429 9535 9625 9699 9761 9812 9854 9887 9913 9934 9951 9963 9973 9980 9986 9990 9993 9995 9996 9997

    5359 5753 6141 6517 6879 7224 7549 7852 8133 8389 8621 8830 9015 9177 9319 9441 9545 9633 9706 9767 9817 9857 9890 9916 9936 9952 9964 9974 9981 9986 9990 9993 9995 9997 9998

    Greene 50240

    book

    June 28 2002

    14 40

    954

    APPENDIX G Statistical Tables

    TABLE G 2 n 750

    Percentiles of the Student s t Distribution Table Entry Is x Such that Prob tn x P
    900 950 975 990 995

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 35 40 45 50 60 70 80 90 100

    1 000 816 765 741 727 718 711 706 703 700 697 695 694 692 691 690 689 688 688 687 686 686 685 685 684 684 684 683 683 683 682 681 680 679 679 678 678 677 677 674

    3 078 1 886 1 638 1 533 1 476 1 440 1 415 1 397 1 383 1 372 1 363 1 356 1 350 1 345 1 341 1 337 1 333 1 330 1 328 1 325 1 323 1 321 1 319 1 318 1 316 1 315 1 314 1 313 1 311 1 310 1 306 1 303 1 301 1 299 1 296 1 294 1 292 1 291 1 290 1 282

    6 314 2 920 2 353 2 132 2 015 1 943 1 895 1 860 1 833 1 812 1 796 1 782 1 771 1 761 1 753 1 746 1 740 1 734 1 729 1 725 1 721 1 717 1 714 1 711 1 708 1 706 1 703 1 701 1 699 1 697 1 690 1 684 1 679 1 676 1 671 1 667 1 664 1 662 1 660 1 645

    12 706 4 303 3 182 2 776 2 571 2 447 2 365 2 306 2 262 2 228 2 201 2 179 2 160 2 145 2 131 2 120 2 110 2 101 2 093 2 086 2 080 2 074 2 069 2 064 2 060 2 056 2 052 2 048 2 045 2 042 2 030 2 021 2 014 2 009 2 000 1 994 1 990 1 987 1 984 1 960

    31 821 6 965 4 541 3 747 3 365 3 143 2 998 2 896 2 821 2 764 2 718 2 681 2 650 2 624 2 602 2 583 2 567 2 552 2 539 2 528 2 518 2 508 2 500 2 492 2 485 2 479 2 473 2 467 2 462 2 457 2 438 2 423 2 412 2 403 2 390 2 381 2 374 2 368 2 364 2 326

    63 657 9 925 5 841 4 604 4 032 3 707 3 499 3 355 3 250 3 169 3 106 3 055 3 012 2 977 2 947 2 921 2 898 2 878 2 861 2 845 2 831 2 819 2 807 2 797 2 787 2 779 2 771 2 763 2 756 2 750 2 724 2 704 2 690 2 678 2 660 2 648 2 639 2 632 2 626 2 576

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX G Statistical Tables

    955

    TABLE G 3 n 005 010

    Percentiles of the Chi Squared Distribution Table Entry Is c such 2 that Prob n c P
    025 050 100 250 500 750 900 950 975 990 995

    1 00004 0002 001 004 02 10 45 1 32 2 71 2 01 02 05 10 21 58 1 39 2 77 4 61 3 07 11 22 35 58 1 21 2 37 4 11 6 25 4 21 30 48 71 1 06 1 92 3 36 5 39 7 78 5 41 55 83 1 15 1 61 2 67 4 35 6 63 9 24 6 68 87 1 24 1 64 2 20 3 45 5 35 7 84 10 64 7 99 1 24 1 69 2 17 2 83 4 25 6 35 9 04 12 02 8 1 34 1 65 2 18 2 73 3 49 5 07 7 34 10 22 13 36 9 1 73 2 09 2 70 3 33 4 17 5 90 8 34 11 39 14 68 10 2 16 2 56 3 25 3 94 4 87 6 74 9 34 12 55 15 99 11 2 60 3 05 3 82 4 57 5 58 7 58 10 34 13 70 17 28 12 3 07 3 57 4 40 5 23 6 30 8 44 11 34 14 85 18 55 13 3 57 4 11 5 01 5 89 7 04 9 30 12 34 15 98 19 81 14 4 07 4 66 5 63 6 57 7 79 10 17 13 34 17 12 21 06 15 4 60 5 23 6 26 7 26 8 55 11 04 14 34 18 25 22 31 16 5 14 5 81 6 91 7 96 9 31 11 91 15 34 19 37 23 54 17 5 70 6 41 7 56 8 67 10 09 12 79 16 34 20 49 24 77 18 6 26 7 01 8 23 9 39 10 86 13 68 17 34 21 60 25 99 19 6 84 7 63 8 91 10 12 11 65 14 56 18 34 22 72 27 20 20 7 43 8 26 9 59 10 85 12 44 15 45 19 34 23 83 28 41 21 8 03 8 90 10 28 11 59 13 24 16 34 20 34 24 93 29 62 22 8 64 9 54 10 98 12 34 14 04 17 24 21 34 26 04 30 81 23 9 26 10 20 11 69 13 09 14 85 18 14 22 34 27 14 32 01 24 9 89 10 86 12 40 13 85 15 66 19 04 23 34 28 24 33 20 25 10 52 11 52 13 12 14 61 16 47 19 94 24 34 29 34 34 38 30 13 79 14 95 16 79 18 49 20 60 24 48 29 34 34 80 40 26 35 17 19 18 51 20 57 22 47 24 80 29 05 34 34 40 22 46 06 40 20 71 22 16 24 43 26 51 29 05 33 66 39 34 45 62 51 81 45 24 31 25 90 28 37 30 61 33 35 38 29 44 34 50 98 57 51 50 27 99 29 71 32 36 34 76 37 69 42 94 49 33 56 33 63 17

    3 84 5 99 7 81 9 49 11 07 12 59 14 07 15 51 16 92 18 31 19 68 21 03 22 36 23 68 25 00 26 30 27 59 28 87 30 14 31 41 32 67 33 92 35 17 36 42 37 65 43 77 49 80 55 76 61 66 67 50

    5 02 7 38 9 35 11 14 12 83 14 45 16 01 17 53 19 02 20 48 21 92 23 34 24 74 26 12 27 49 28 85 30 19 31 53 32 85 34 17 35 48 36 78 38 08 39 36 40 65 46 98 53 20 59 34 65 41 71 42

    6 63 9 21 11 34 13 28 15 09 16 81 18 48 20 09 21 67 23 21 24 72 26 22 27 69 29 14 30 58 32 00 33 41 34 81 36 19 37 57 38 93 40 29 41 64 42 98 44 31 50 89 57 34 63 69 69 96 76 15

    7 88 10 60 12 84 14 86 16 75 18 55 20 28 21 95 23 59 25 19 26 76 28 30 29 82 31 32 32 80 34 27 35 72 37 16 38 58 40 00 41 40 42 80 44 18 45 56 46 93 53 67 60 27 66 77 73 17 79 49

    Greene 50240

    book

    June 28 2002

    14 40

    956

    APPENDIX G Statistical Tables

    TABLE G 4

    95th Percentiles of the F Distribution Table Entry is f such that Prob Fn1 n2 f 95
    n1 Degrees of Freedom for the Numerator 2 3 4 5 6 7 8 9

    n2

    1

    1 2 3 4 5 6 7 8 9 10 15 20 25 30 40 50 70 100
    n2

    161 45 18 51 10 13 7 71 6 61 5 99 5 59 5 32 5 12 4 96 4 54 4 35 4 24 4 17 4 08 4 03 3 98 3 94 3 84
    10

    199 50 19 00 9 55 6 94 5 79 5 14 4 74 4 46 4 26 4 10 3 68 3 49 3 39 3 32 3 23 3 18 3 13 3 09 3 00
    12

    215 71 19 16 9 28 6 59 5 41 4 76 4 35 4 07 3 86 3 71 3 29 3 10 2 99 2 92 2 84 2 79 2 74 2 70 2 60
    15

    224 58 19 25 9 12 6 39 5 19 4 53 4 12 3 84 3 63 3 48 3 06 2 87 2 76 2 69 2 61 2 56 2 50 2 46 2 37
    20

    230 16 19 30 9 01 6 26 5 05 4 39 3 97 3 69 3 48 3 33 2 90 2 71 2 60 2 53 2 45 2 40 2 35 2 31 2 21
    30

    233 99 19 33 8 94 6 16 4 95 4 28 3 87 3 58 3 37 3 22 2 79 2 60 2 49 2 42 2 34 2 29 2 23 2 19 2 10
    40

    236 77 19 35 8 89 6 09 4 88 4 21 3 79 3 50 3 29 3 14 2 71 2 51 2 40 2 33 2 25 2 20 2 14 2 10 2 01
    50

    238 88 19 37 8 85 6 04 4 82 4 15 3 73 3 44 3 23 3 07 2 64 2 45 2 34 2 27 2 18 2 13 2 07 2 03 1 94
    60

    240 54 19 38 8 81 6 00 4 77 4 10 3 68 3 39 3 18 3 02 2 59 2 39 2 28 2 21 2 12 2 07 2 02 1 97 1 88


    1 2 3 4 5 6 7 8 9 10 15 20 25 30 40 50 70 100

    241 88 19 40 8 79 5 96 4 74 4 06 3 64 3 35 3 14 2 98 2 54 2 35 2 24 2 16 2 08 2 03 1 97 1 93 1 83

    243 91 19 41 8 74 5 91 4 68 4 00 3 57 3 28 3 07 2 91 2 48 2 28 2 16 2 09 2 00 1 95 1 89 1 85 1 75

    245 95 19 43 8 70 5 86 4 62 3 94 3 51 3 22 3 01 2 85 2 40 2 20 2 09 2 01 1 92 1 87 1 81 1 77 1 67

    248 01 19 45 8 66 5 80 4 56 3 87 3 44 3 15 2 94 2 77 2 33 2 12 2 01 1 93 1 84 1 78 1 72 1 68 1 57

    250 10 19 46 8 62 5 75 4 50 3 81 3 38 3 08 2 86 2 70 2 25 2 04 1 92 1 84 1 74 1 69 1 62 1 57 1 46

    251 14 19 47 8 59 5 72 4 46 3 77 3 34 3 04 2 83 2 66 2 20 1 99 1 87 1 79 1 69 1 63 1 57 1 52 1 39

    252 20 19 48 8 57 5 69 4 43 3 74 3 30 3 01 2 79 2 62 2 16 1 95 1 82 1 74 1 64 1 58 1 50 1 45 1 34

    252 20 19 48 8 57 5 69 4 43 3 74 3 30 3 01 2 79 2 62 2 16 1 95 1 82 1 74 1 64 1 58 1 50 1 45 1 31

    254 19 19 49 8 53 5 63 4 37 3 67 3 23 2 93 2 71 2 54 2 07 1 85 1 72 1 63 1 52 1 45 1 36 1 30 1 30

    Greene 50240

    book

    June 28 2002

    14 40

    APPENDIX G Statistical Tables

    957

    TABLE G 5

    99th Percentiles of the F Distribution Table Entry is f such that Prob Fn1 n2 f 99
    n1 Degrees of Freedom for the Numerator 2 3 4 5 6 7 8 9

    n2

    1

    1 2 3 4 5 6 7 8 9 10 15 20 25 30 40 50 70 100
    n2

    4052 18 98 50 34 12 21 20 16 26 13 75 12 25 11 26 10 56 10 04 8 68 8 10 7 77 7 56 7 31 7 17 7 01 6 90 6 66
    10

    4999 50 99 00 30 82 18 00 13 27 10 92 9 55 8 65 8 02 7 56 6 36 5 85 5 57 5 39 5 18 5 06 4 92 4 82 4 63
    12

    5403 35 99 17 29 46 16 69 12 06 9 78 8 45 7 59 6 99 6 55 5 42 4 94 4 68 4 51 4 31 4 20 4 07 3 98 3 80
    15

    5624 58 99 25 28 71 15 98 11 39 9 15 7 85 7 01 6 42 5 99 4 89 4 43 4 18 4 02 3 83 3 72 3 60 3 51 3 34
    20

    5763 65 99 30 28 24 15 52 10 97 8 75 7 46 6 63 6 06 5 64 4 56 4 10 3 85 3 70 3 51 3 41 3 29 3 21 3 04
    30

    5858 99 99 33 27 91 15 21 10 67 8 47 7 19 6 37 5 80 5 39 4 32 3 87 3 63 3 47 3 29 3 19 3 07 2 99 2 82
    40

    5928 36 99 36 27 67 14 98 10 46 8 26 6 99 6 18 5 61 5 20 4 14 3 70 3 46 3 30 3 12 3 02 2 91 2 82 2 66
    50

    5981 07 99 37 27 49 14 80 10 29 8 10 6 84 6 03 5 47 5 06 4 00 3 56 3 32 3 17 2 99 2 89 2 78 2 69 2 53
    60

    6022 47 99 39 27 35 14 66 10 16 7 98 6 72 5 91 5 35 4 94 3 89 3 46 3 22 3 07 2 89 2 78 2 67 2 59 2 43


    1 2 3 4 5 6 7 8 9 10 15 20 25 30 40 50 70 100

    6055 85 99 40 27 23 14 55 10 05 7 87 6 62 5 81 5 26 4 85 3 80 3 37 3 13 2 98 2 80 2 70 2 59 2 50 2 34

    6106 32 99 42 27 05 14 37 9 89 7 72 6 47 5 67 5 11 4 71 3 67 3 23 2 99 2 84 2 66 2 56 2 45 2 37 2 20

    6157 28 99 43 26 87 14 20 9 72 7 56 6 31 5 52 4 96 4 56 3 52 3 09 2 85 2 70 2 52 2 42 2 31 2 22 2 06

    6208 73 99 45 26 69 14 02 9 55 7 40 6 16 5 36 4 81 4 41 3 37 2 94 2 70 2 55 2 37 2 27 2 15 2 07 1 90

    6260 65 99 47 26 50 13 84 9 38 7 23 5 99 5 20 4 65 4 25 3 21 2 78 2 54 2 39 2 20 2 10 1 98 1 89 1 72

    6286 78 99 47 26 41 13 75 9 29 7 14 5 91 5 12 4 57 4 17 3 13 2 69 2 45 2 30 2 11 2 01 1 89 1 80 1 61

    6313 03 99 48 26 32 13 65 9 20 7 06 5 82 5 03 4 48 4 08 3 05 2 61 2 36 2 21 2 02 1 91 1 78 1 69 1 50

    6313 03 99 48 26 32 13 65 9 20 7 06 5 82 5 03 4 48 4 08 3 05 2 61 2 36 2 21 2 02 1 91 1 78 1 69 1 50

    6362 68 99 50 26 14 13 47 9 03 6 89 5 66 4 87 4 32 3 92 2 88 2 43 2 18 2 02 1 82 1 70 1 56 1 45 1 16

    Greene 50240

    book

    June 28 2002

    14 40

    958

    APPENDIX G Statistical Tables

    TABLE G 6 k 1 n dL dU

    Durbin Watson Statistic 5 Percent Signi cance Points of dL and dU
    k 2 dL dU k 3 dL dU k 4 dL dU k 5 dL dU k 10 dL dU k 15 dL dU

    15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 45 50 55 60 65 70 75 80 85 90 95 100

    1 08 1 10 1 13 1 16 1 18 1 20 1 22 1 24 1 26 1 27 1 29 1 30 1 32 1 33 1 34 1 35 1 36 1 37 1 38 1 39 1 40 1 41 1 42 1 43 1 43 1 44 1 48 1 50 1 53 1 55 1 57 1 58 1 60 1 61 1 62 1 63 1 64 1 65

    1 36 1 37 1 38 1 39 1 40 1 41 1 42 1 43 1 44 1 45 1 45 1 46 1 47 1 48 1 48 1 49 1 50 1 50 1 51 1 51 1 52 1 52 1 53 1 54 1 54 1 54 1 57 1 59 1 60 1 62 1 63 1 64 1 65 1 66 1 67 1 68 1 69 1 69

    95 98 1 02 1 05 1 08 1 10 1 13 1 15 1 17 1 19 1 21 1 22 1 24 1 26 1 27 1 28 1 30 1 31 1 32 1 33 1 34 1 35 1 36 1 37 1 38 1 39 1 43 1 46 1 49 1 51 1 54 1 55 1 57 1 59 1 60 1 61 1 62 1 63

    1 54 1 54 1 54 1 53 1 53 1 54 1 54 1 54 1 54 1 55 1 55 1 55 1 56 1 56 1 56 1 57 1 57 1 57 1 58 1 58 1 53 1 59 1 59 1 59 1 60 1 60 1 62 1 63 1 64 1 65 1 66 1 67 1 68 1 69 1 70 1 70 1 71 1 72

    82 86 90 93 97 1 00 1 03 1 05 1 08 1 10 1 12 1 14 1 16 1 18 1 20 1 21 1 23 1 24 1 26 1 27 1 28 1 29 1 31 1 32 1 33 1 34 1 38 1 42 1 45 1 48 1 50 1 52 1 54 1 56 1 57 1 59 1 60 1 61

    1 75 1 73 1 71 1 69 1 68 1 68 1 67 1 66 1 66 1 66 1 66 1 65 1 65 1 65 1 65 1 65 1 65 1 65 1 65 1 65 1 65 1 65 1 66 1 66 1 66 1 66 1 67 1 67 1 68 1 69 1 70 1 70 1 71 1 72 1 72 1 73 1 73 1 74

    69 74 78 82 86 90 93 96 99 1 01 1 04 1 06 1 08 1 10 1 12 1 14 1 16 1 18 1 19 1 21 1 22 1 24 1 25 1 26 1 27 1 29 1 34 1 38 1 41 1 44 1 47 1 49 1 51 1 53 1 55 1 57 1 58 1 59

    1 97 1 93 1 90 1 87 1 85 1 83 1 81 1 80 1 79 1 78 1 77 1 76 1 76 1 75 1 74 1 74 1 74 1 73 1 73 1 73 1 73 1 73 1 72 1 72 1 72 1 72 1 72 1 72 1 72 1 73 1 73 1 74 1 74 1 74 1 75 1 75 1 75 1 76

    56 62 67 71 75 79 83 86 90 93 95 98 1 01 1 03 1 05 1 07 1 09 1 11 1 13 1 15 1 16 1 18 1 19 1 21 1 22 1 23 1 29 1 34 1 38 1 41 1 44 1 46 1 49 1 51 1 52 1 54 1 56 1 57

    2 21 2 15 2 10 2 06 2 02 1 99 1 96 1 94 1 92 1 90 1 89 1 88 1 86 1 85 1 84 1 83 1 83 1 82 1 81 1 81 1 80 1 80 1 80 1 79 1 79 1 79 1 78 1 77 1 77 1 77 1 77 1 77 1 77 1 77 1 77 1 78 1 78 1 78

    16 20 24 29 34 38 42 47 51 54 58 62 65 68 71 74 77 80 82 85 87 89 91 93 95 1 04 1 11 1 17 1 22 1 27 1 30 1 34 1 37 1 40 1 42 1 44 1 46

    3 30 3 18 3 07 2 97 2 89 2 81 2 73 2 67 2 61 2 57 2 51 2 47 2 43 2 40 2 36 2 33 2 31 2 28 2 26 2 24 2 22 2 20 2 18 2 16 2 15 2 09 2 04 2 01 1 98 1 96 1 95 1 94 1 93 1 92 1 91 1 90 1 90

    06 09 12 15 19 22 26 29 33 36 39 43 46 49 52 55 58 60 63 65 68 79 88 96 1 03 1 09 1 14 1 18 1 22 1 26 1 29 1 32 1 35

    3 68 3 58 3 55 3 41 3 33 3 25 3 18 3 11 3 05 2 99 2 94 2 99 2 84 2 80 2 75 2 72 2 68 2 65 2 61 2 59 2 56 2 44 2 35 2 28 2 23 2 18 2 15 2 12 2 09 2 07 2 06 2 04 2 03

    Source Extracted from N E Savin and K J White The Dubin Watson Test for Serial Correlation with Extreme Sample Sizes and Many Regressors Econometrica 45 8 Nov 1977 pp 1992 1995 Note k is the number of regressors excluding the intercept

    Greene 50240

    book

    June 7 2002

    22 36

    REFERENCES

    Q
    Abowd J and H Farber Job Queues and Union Status of Workers Industrial and Labor Relations Review 35 1982 pp 354 367 Abramovitz M and I Stegun Handbook of Mathematical Functions New York Dover Press 1971 Af eck Graves J and B McDonald Nonnormalities and Tests of Asset Pricing Theories Journal of Finance 44 1989 pp 889 908 A T and R Elashoff Missing Observations in Multivariate Statistics Journal of the American Statistical Association 61 1966 pp 595 604 A T and R Elashoff Missing Observations in Multivariate Statistics Journal of the American Statistical Association 62 1967 pp 10 29 Ahn S and P Schmidt Ef cient Estimation of Models for Dynamic Panel Data Journal of Econometrics 68 1 1995 pp 5 28 Aigner D MSE Dominance of Least Squares with Errors of Observation Journal of Econometrics 2 1974 pp 365 372 Aigner D K Lovell and P Schmidt Formulation and Estimation of Stochastic Frontier Production Models Journal of Econometrics 6 1977 pp 21 37 Aitchison J and J Brown The Lognormal Distribution with Special Reference to Its Uses in Economics New York Cambridge University Press 1969 Aitken A C On Least Squares and Linear Combinations of Observations Proceedings of the Royal Statistical Society 55 1935 pp 42 48 Akaike H Information Theory and an Extension of the Maximum Likelihood Principle In B Petrov and F Csake eds Second International Symposium on Information Theory Budapest Akademiai Kiado 1973 Akin J D Guilkey and R Sickles A Random Coef cient Probit Model with an Application to a Study of Migration Journal of Econometrics 11 1979 pp 233 246 Albert J and S Chib Bayesian Analysis of Binary and Polytomous Response Data Journal of the American Statistical Association 88 1993a pp 669 679 Albert J and S Chib Bayes Inference via Gibbs Sampling of Autoregressive Time Series Subject to Markov Mean and Variance Shifts Journal of Business and Economic Statistics 11 1993b pp 1 15 Aldrich J and F Nelson Linear Probability Logit and Probit Models Beverly Hills Sage Publications 1984 Ali M and C Giaccotto A Study of Several New and Existing Tests for Heteroscedasticity in the General Linear Model Journal of Econometrics 26 1984 pp 355 374 Allenby G and J Ginter The Effects of In Store Displays and Feature Advertising on Consideration Sets International Journal of Research in Marketing 12 1995 pp 67 80 Allison P Problems with Fixed Effects Negative Binomial Models Manuscript Department of Sociology University of Pennsylvania 2000 Almon S The Distributed Lag Between Capital Appropriations and Expenditures Econometrica 33 1965 pp 178 196 959

    Greene 50240

    book

    June 7 2002

    22 36

    960

    References

    Altonji J and R Matzkin Panel Data Estimators for Nonseparable Models with Endogenous Regressors NBER Working Paper t0267 Cambridge 2001 Alvarez R G Garrett and P Lange Government Partisanship Labor Organization and Macroeconomic Performance American Political Science Review 85 1991 pp 539 556 Amemiya T The Estimation of Variances in a Variance Components Model International Economic Review 12 1971 pp 1 13 Amemiya T Regression Analysis When the Dependent Variable Is Truncated Normal Econometrica 41 1973 pp 997 1016 Amemiya T Some Theorems in the Linear Probability Model International Economic Review 18 1977 pp 645 650 Amemiya T Qualitative Response Models A Survey Journal of Economic Literature 19 4 1981 pp 481 536 Amemiya T Tobit Models A Survey Journal of Econometrics 24 1984 pp 3 63 Amemiya T Advanced Econometrics Cambridge Harvard University Press 1985 Amemiya T and T MaCurdy Instrumental Variable Estimation of an Error Components Model Econometrica 54 1986 pp 869 881 Anderson E Asymptotic Properties of Conditional Maximum Likelihood Estimators Journal of the Royal Statistical Society Series B 32 1970 pp 283 301 Anderson G and R Blundell Estimation and Hypothesis Testing in Dynamic Singular Equation Systems Econometrica 50 1982 pp 1559 1572 Anderson R and J Thursby Con dence Intervals for Elasticity Estimators in Translog Models Review of Economics and Statistics 68 1986 pp 647 657 Anderson T The Statistical Analysis of Time Series New York John Wiley and Sons 1971 Anderson T and C Hsiao Estimation of Dynamic Models with Error Compo

    nents Journal of the American Statistical Association 76 1981 pp 598 606 Anderson T and C Hsiao Formulation and Estimation of Dynamic Models Using Panel Data Journal of Econometrics 18 1982 pp 67 82 Anderson T and H Rubin Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations Annals of Mathematical Statistics 20 1949 pp 46 63 Anderson T and H Rubin The Asymptotic Properties of Estimators of the Parameters of a Single Equation in a Complete System of Stochastic Equations Annals of Mathematical Statistics 21 1950 pp 570 582 Andrews D A Robust Method for Multiple Linear Regression Technometrics 16 1974 pp 523 531 Andrews D Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation Econometrica 59 1991 pp 817 858 Andrews D Tests for Parameter Instability and Structural Change with Unknown Change Point Econometrica 61 1993 pp 821 856 Andrews D and R Fair Inference in Nonlinear Econometric Models with Structural Change Review of Economic Studies 55 1988 pp 615 640 Andrews D and W Ploberger Optimal Tests When a Nuisance Parameter is Present Only Under the Alternative Econometrica 62 1994 pp 1383 1414 Aneuryn Evans G and A Deaton Testing Linear versus Logarithmic Regression Models Review of Economic Studies 47 1980 pp 275 291 Angrist J Estimation of Limited Dependent Variable Models with Dummy Endogenous Regressors Simple Strategies for Empirical Practice Journal of Business and Economic Statistics 29 1 2001 pp 2 15 Arabmazar A and P Schmidt An Investigation into the Robustness of the Tobit

    Greene 50240

    book

    June 7 2002

    22 36

    References

    961

    Estimator to Nonnormality Econometrica 50 1982a pp 1055 1063 Arabmazar A and P Schmidt Further Evidence on the Robustness of the Tobit Estimator to Heteroscedasticity Journal of Econometrics 17 1982b pp 253 258 Arellano M Computing Robust Standard Errors for Within Groups Estimators Oxford Bulletin of Economics and Statistics 49 1987 pp 431 434 Arellano M A Note on the Anderson Hsiao Estimator for Panel Data Economics Letters 31 1989 pp 337 341 Arellano M Discrete Choices with Panel Data Investigaciones Economica Lecture 25 2000 Arellano M Panel Data Some Recent Developments in J Heckman and E Leamer eds Handbook of Econometrics Volume 5 North Holland Amsterdam 2001 Arellano M and S Bond Some Tests of Speci cation for Panel Data Monte Carlo Evidence and an Application to Employment Equations Review of Economics Studies 58 1991 pp 277 297 Arellano M and C Borrego Symmetrically Normalized Instrumental Variable Estimation Using Panel Data Journal of Business and Economic Statistics 17 1999 pp 36 49 Arellano M and O Bover Another Look at the Instrumental Variables Estimation of Error Components Models Journal of Econometrics 68 1 1995 pp 29 52 Arrow K H Chenery B Minhas and R Solow Capital Labor Substitution and Economic Ef ciency Review of Economics and Statistics 45 1961 pp 225 247 Ashenfelter O and J Heckman The Estimation of Income and Substitution Effects in a Model of Family Labor Supply Econometrica 42 1974 pp 73 85 Ashenfelter O and A Kreuger Estimates of the Economic Return to Schooling from a New Sample of Twins American Economic Review 84 1994 pp 1157 1173

    Att eld C Bartlett Adjustments for Systems of Linear Equations with Linear Restrictions Economics Letters 60 1998 pp 277 283 Avery R L Hansen and J Hotz Multiperiod Probit Models and Orthogonality Condition Estimation International Economic Review 24 1983 pp 21 35 Bai J Estimation of a Change Point in Multiple Regression Models Review of Economics and Statistics 79 1997 pp 551 563 Bai J Likelihood Ratio Tests for Multiple Structural Changes Journal of Econometrics 91 1999 pp 299 323 Bai J R Lumsdaine and J Stock Testing for and Dating Breaks in Integrated and Cointegrated Time Series mimeo Department of Economics MIT 1991 Bai J and P Perron Estimating and Testing Linear Models with Multiple Structural Changes Econometrica 66 1998a pp 47 78 Bai J and P Perron Testing for and Estimation of Multiple Structural Changes Econometrica 66 1998b pp 817 858 Baillie R The Asymptotic Mean Squared Error of Multistep Prediction From the Regression Model with Autoregressive Errors Journal of the American Statistical Association 74 1979 pp 175 184 Baillie R Long Memory Processes and Fractional Integration in Econometrics Journal of Econometrics 73 1 1996 pp 5 59 Balestra P and M Nerlove Pooling Cross Section and Time Series Data in the Estimation of a Dynamic Model The Demand for Natural Gas Econometrica 34 1966 pp 585 612 Baltagi B Pooling Under Misspeci cation Some Monte Carlo Evidence on the Kmenta and Error Components Techniques Econometric Theory 2 1986 pp 429 441 Baltagi B Applications of a Necessary and Suf cient Condition for OLS to be BLUE Statistics and Probability Letters 8 1989 pp 457 461

    Greene 50240

    book

    June 7 2002

    22 36

    962

    References

    Baltagi B Econometric Analysis of Panel Data New York John Wiley and Sons 1995 Baltagi G S Garvin and S Kerman Further Evidence on Seemingly Unrelated Regressions with Unequal Number of Observations Annales D Economie et de Statistique 14 1989 pp 103 115 Baltagi B and W Grif n A Generalized Error Component Model with Heteroscedastic Disturbances International Economic Review 29 1988 pp 745 753 Barnow B G Cain and A Goldberger Issues in the Analysis of Selectivity Bias In E Stromsdorfer and G Farkas eds Evaluation Studies Review Annual Vol 5 Beverly Hills Sage Publications 1981 Bartels R and D Feibig A Simple Characterization of Seemingly Unrelated Regressions Models in Which OLS is BLUE American Statistician 45 1992 pp 137 140 Barten A Maximum Likelihood Estimation of A Complete System of Demand Equations European Economic Review Fall 1 1969 pp 7 73 Bazzara M and C Shetty Nonlinear Programming Theory and Algorithms New York John Wiley and Sons 1979 Beach C and J MacKinnon A Maximum Likelihood Procedure for Regression with Autocorrelated Errors Econometrica 46 1978a pp 51 58 Beach C and J MacKinnon Full Maximum Likelihood Estimation of Second Order Autoregressive Error Models Journal of Econometrics 7 1978b pp 187 198 Beck N D Epstein and S Jackman Estimating Dynamic Time Series CrossSection Models with a Binary Dependent Variable Manuscript Department of Political Science University of California San Diego 2001 Beck N J Katz R Alvarez G Garrett and P Lange Government Partisanship Labor Organization and Macroeconomic Performance A Corrigendum American Political Science Review 87 4 1993 pp 945 948

    Beck N and J Katz What to Do and Not to Do with Time Series Cross Section Data in Comparative Politics American Political Science Review 89 1995 pp 634 647 Beggs S S Cardell and J Hausman Assessing the Potential Demand for Electric Cars Journal of Econometrics 17 1981 pp 19 20 Bekker P and T Wansbeek Identi cation in Parametric Models in Baltagi B ed A Companion to Theoretical Econometrics Blackwell Oxford 2001 Belsley D On the Ef cient Computation of the Nonlinear Full Information Maximum Likelihood Estimator Technical Report no 5 Center for Computational Research in Economics and Management Science Vol II Cambridge Mass 1980 Belsley D E Kuh and R Welsh Regression Diagnostics Identifying In uential Data and Sources of Collinearity John Wiley and Sons New York 1980 Ben Akiva M and S Lerman Discrete Choice Analysis London MIT Press 1985 Ben Porath Y Labor Force Participation Rates and Labor Supply Journal of Political Economy 81 1973 pp 697 704 Bera A and C Jarque Ef cient Tests for Normality Heteroscedasticity and Serial Independence of Regression Residuals Monte Carlo Evidence Economics Letters 7 1981 pp 313 318 Bera A and C Jarque Model Speci cation Tests A Simultaneous Approach Journal of Econometrics 20 1982 pp 59 82 Bera A C Jarque and L Lee Testing for the Normality Assumption in Limited Dependent Variable Models Mimeo Department of Economics University of Minnesota 1982 Bernard J and M Veall The Probability Distribution of Future Demand Journal of Business and Economic Statistics 5 1987 pp 417 424 Berndt E The Practice of Econometrics Reading Mass Addison Wesley 1990

    Greene 50240

    book

    June 7 2002

    22 36

    References

    963

    Berndt E and L Christensen The Translog Function and the Substitution of Equipment Structures and Labor in U S Manufacturing 1929 1968 Journal of Econometrics 1 1973 pp 81 114 Berndt E B Hall R Hall and J Hausman Estimation and Inference in Nonlinear Structural Models Annals of Economic and Social Measurement 3 4 1974 pp 653 665 Berndt E and E Savin Con ict Among Criteria for Testing Hypotheses in the Multivariate Linear Regression Model Econometrica 45 1977 pp 1263 1277 Berndt E and D Wood Technology Prices and the Derived Demand for Energy Review of Economics and Statistics 57 1975 pp 376 384 Berry S J Levinsohn and A Pakes Automobile Prices in Market Equilibrium Econometrica 63 4 1995 pp 841 890 Bertschek I and M Lechner Convenient Estimators for the Panel Probit Model Journal of Econometrics 87 2 1998 pp 329 372 Berzeg K The Error Components Model Conditions for the Existence of Maximum Likelihood Estimates Journal of Econometrics 10 1979 pp 99 102 Beyer A Modelling Money Demand in Germany Journal of Applied Econometrics 13 1 1998 pp 57 76 Bhargava A and J Sargan Estimating Dynamic Random Effects Models from Panel Data Covering Short Periods Econometrica 51 1983 pp 221 236 Bhat C A Heteroscedastic Extreme Value Model of Intercity Mode Choice Working paper Department of Civil Engineering University of Massachusetts Amherst 1995 Transportation Research 30 1 pp 16 29 Bhat C Accommodating Variations in Responsiveness to Level of Service Measures in Travel Mode Choice Modeling Department of Civil Engineering University of Massachusetts Amherst 1996 Bhat C Quasi Random Maximum Simulated Likelihood Estimation of the Mixed

    Multinomial Logit Model Manuscript Department of Civil Engineering University of Texas Austin 1999 Bickel P and K Doksum Mathematical Statistics San Francisco Holden Day 2000 Billingsley P Probability and Measure New York John Wiley and Sons 1979 Binkley J The Effect of Variable Correlation on the Ef ciency of Seemingly Unrelated Regression in a Two Equation Model Journal of the American Statistical Association 77 1982 pp 890 895 Binkley J and C Nelson A Note on the Ef ciency of Seemingly Unrelated Regression American Statistician 42 1988 pp 137 139 Birkes D and Y Dodge Alternative Methods of Regression New York John Wiley and Sons 1993 Black F Capital Market Equilibrium with Restricted Borrowing Journal of Business 44 1972 pp 444 454 Blanchard O and D Quah The Dynamic Effects of Aggregate Demand and Supply Disturbances American Economic Review 79 1989 pp 655 673 Blundell R ed Speci cation Testing in Limited and Discrete Dependent Variable Models Journal of Econometrics 34 1 2 1987 pp 1 274 Blundell R and S Bond Initial Conditions and Moment Restrictions in Dynamic Panel Data Models Journal of Econometrics 87 1998 pp 115 143 Blundell R F Laisney and M Lechner Alternative Interpretations of Hours Information in an Econometric Model of Labour Supply Empirical Economics 18 1993 pp 393 415 Bockstael N I Strand K McConnell and F Arsanjani Sample Selection Bias in the Estimation of Recreation Demand Functions An Application to Sport Fishing Land Economics 66 1990 pp 40 49 Bollerslev T Generalized Autoregressive Conditional Heteroscedasticity Journal of Econometrics 31 1986 pp 307 327

    Greene 50240

    book

    June 7 2002

    22 36

    964

    References

    Bollerslev T R Chou and K Kroner ARCH Modeling in Finance Journal of Econometrics 52 1992 pp 5 59 Bollerslev T and E Ghysels Periodic Autoregressive Conditional Heteroscedasticity Journal of Business and Economic Statistics 14 1996 pp 139 151 Boot J and G deWitt Investment Demand An Empirical Contribution to the Aggregation Problem International Economic Review 1 1960 pp 3 30 Borsch Supan A and V Hajivassiliou Smooth Unbiased Multivariate Probability Simulators for Maximum Likelihood Estimation of Limited Dependent Variable Models Journal of Econometrics 58 3 1990 pp 347 368 Boskin M A Conditional Logit Model of Occupational Choice Journal of Political Economy 82 1974 pp 389 398 Bover O and M Arellano Estimating Dynamic Limited Dependent Variable Models from Panel Data Investigaciones Economicas Econometrics Special Issue 21 1997 pp 141 165 Box G and D Cox An Analysis of Transformations Journal of the Royal Statistical Society 1964 Series B 1964 pp 211 264 Box G and G Jenkins Time Series Analysis Forecasting and Control 2nd ed San Francisco Holden Day 1984 Box G and M Muller A Note on the Generation of Random Normal Deviates Annals of Mathematical Statistics 29 1958 pp 610 611 Box G and D Pierce Distribution of Residual Autocorrelations in Autoregressive Moving Average Time Series Models Journal of the American Statistical Association 65 1970 pp 1509 1526 Boyes W D Hoffman and S Low An Econometric Analysis of the Bank Credit Scoring Problem Journal of Econometrics 40 1989 pp 3 14 Brannas K Explanatory Variables in the AR 1 Count Data Model Working Paper No 381 Department of Economics University of Umea Sweden 1995

    Brannas K and P Johanssen Panel Data Regressions for Counts Manuscript Department of Economics University of Umea Sweden 1994 Breslaw J Evaluation of Multivariate Normal Probabilities Using a Low Variance Simulator Review of Economics and Statistics 76 1994 pp 673 682 Breusch T Testing for Autocorrelation in Dynamic Linear Models Australian Economic Papers 17 1978 pp 334 355 Breusch T and A Pagan A Simple Test for Heteroscedasticity and Random Coef cient Variation Econometrica 47 1979 pp 1287 1294 Breusch T and A Pagan The LM Test and Its Applications to Model Speci cation in Econometrics Review of Economic Studies 47 1980 pp 239 254 Brock W and S Durlauf Discrete Choice with Social Interactions Working paper 2007 Department of Economics University of Wisconsin Madison 2001 Brown B J Durbin and J Evans Techniques for Testing the Constancy of Regression Relationships Over Time Journal of the Royal Statistical Society Series B 37 1975 pp 149 172 Brown B and M Walker Stochastic Speci cation in Random Production Models of Cost Minimizing Firms Journal of Econometrics 66 1995 pp 175 205 Brown C and R Mof tt The Effect of Ignoring Heteroscedasticity on Estimates of the Tobit Model Mimeo University of Maryland Department of Economics June 1982 Brundy J and D Jorgenson Consistent and Ef cient Estimation of Systems of Simultaneous Equations by Means of Instrumental Variables Review of Economics and Statistics 53 1971 pp 207 224 Burnett N Gender Economics Courses in Liberal Arts Colleges Journal of Economic Education 28 4 1997 pp 369 377 Burnside C and M Eichenbaum SmallSample Properties of GMM Based Wald

    Greene 50240

    book

    June 7 2002

    22 36

    References

    965

    Tests Journal of Business and Economic Statistics 14 3 1996 pp 294 308 Buse A Goodness of Fit in Generalized Least Squares Estimation American Statistician 27 1973 pp 106 108 Buse A The Likelihood Ratio Wald and Lagrange Multiplier Tests An Expository Note American Statistician 36 1982 pp 153 157 Butler J and R Mof tt A Computationally Ef cient Quadrature Procedure for the One Factor Multinomial Probit Model Econometrica 50 1982 pp 761 764 Butler R J McDonald R Nelson and S White Robust and Partially Adaptive Estimation of Regression Models Review of Economics and Statistics 72 1990 pp 321 327 Cameron A and P Trivedi Econometric Models Based on Count Data Comparisons and Applications of Some Estimators and Tests Journal of Applied Econometrics 1 1986 pp 29 54 Cameron A and P Trivedi Regression Based Tests for Overdispersion in the Poisson Model Journal of Econometrics 46 1990 pp 347 364 Cameron C and P Trivedi Regression Analysis of Count Data New York Cambridge University Press 1998 Cameron C and F Windmeijer R Squared Measures for Count Data Regression Models with Applications to Health Care Utilization Working Paper No 93 24 Department of Economics University of California Davis 1993 Campbell J A Lo and A MacKinlay The Econometrics of Financial Markets Princeton Princeton University Press 1997 Campbell J and G Mankiw Consumption Income and Interest Rates Reinterpreting the Time Series Evidence Working Paper 2924 NBER Cambridge Mass 1989 Campbell J and P Perron Pitfalls and Opportunities What Macroeconomists Should Know About Unit Roots National Bureau of Economic Re

    search Macroeconomics Conference Cambridge Mass February 1991 Carlin B and S Chib Bayesian Model Choice via Markov Chain Monte Carlo Journal of the Royal Statistical Society Series B 57 1995 pp 408 417 Casella G and E George Explaining the Gibbs Sampler American Statistician 46 3 1992 pp 167 174 Caudill S An Advantage of the Linear Probability Model Over Probit or Logit Oxford Bulletin of Economics and Statistics 50 1988 pp 425 427 Caves D L Christensen and M Trethaway Flexible Cost Functions for Multiproduct Firms Review of Economics and Statistics 62 1980 pp 477 481 Cecchetti S Comment in Monetary Policy G Mankiw ed Chicago University of Chicago Press 1994 Cecchetti S and R Rich Structural Estimates of the U S Sacri ce Ratio Journal of Business and Economic Statistics 19 4 2001 pp 416 427 Chamberlain G Omitted Variable Bias in Panel Data Estimating the Returns to Schooling Annales de L Insee 30 31 1978 pp 49 82 Chamberlain G Analysis of Covariance with Qualitative Data Review of Economic Studies 47 1980 pp 225 238 Chamberlain G Panel Data In Z Griliches and M Intriligator eds Handbook of Econometrics Amsterdam North Holland 1984 Chamberlain G Heterogeneity Omitted Variable Bias and Duration Dependence in J Heckman and B Singer eds Longitudinal Analysis of Labor Market Data Cambridge University Press Cambridge 1985 Chamberlain G Asymptotic Ef ciency in Estimation with Conditional Moment Restrictions Journal of Econometrics 34 1987 pp 305 334 Chamberlain G and E Leamer Matrix Weighted Averages and Posterior Bounds Journal of the Royal Statistical Society Series B 1976 pp 73 84

    Greene 50240

    book

    June 7 2002

    22 36

    966

    References

    Chambers R Applied Production Analysis A Dual Approach New York Cambridge University Press 1988 Charlier E B Melenberg and A Van Soest A Smoothed Maximum Score Estimator for the Binary Choice Panel Data Model with an Application to Labor Force Participation Statistica Neerlander 49 1995 pp 324 343 Chat eld C The Analysis of Time Series An Introduction 5th ed London Chapman and Hall 1996 Chavez J and K Segerson Stochastic Speci cation and Estimation of Share Equation Systems Journal of Econometrics 35 1987 pp 337 358 Chen T Root N Consistent Estimation of a Panel Data Sample Selection Model Hong Kong University of Science and Technology Manuscript 1998 Chesher A and M Irish Residual Analysis in the Grouped Data and Censored Normal Linear Model Journal of Econometrics 34 1987 pp 33 62 Chesher A T Lancaster and M Irish On Detecting the Failure of Distributional Assumptions Annales de L Insee 59 60 1985 pp 7 44 Cheung C and A Goldberger Proportional Projections in Limited Dependent Variable Models Econometrica 52 1984 pp 531 534 Cheung Y Long Memory in ForeignExchange Rates Journal of Business and Economic Statistics 11 1 1993 pp 93 102 Chiappori R Econometric Models of Insurance Under Asymmetric Information Manuscript Department of Economics University of Chicago 1998 Chib S Bayes Regression for the Tobit Censored Regression Model Journal of Econometrics 51 1992 pp 79 99 Chib S and E Greenberg Markov Chain Monte Carlo Simulation Methods in Econometrics Econometric Theory 12 1996 pp 409 431 Chou R Volatility Persistence and Stock Valuations Some Empirical Evidence

    Using GARCH Journal of Applied Econometrics 3 1988 pp 279 294 Chow G Tests of Equality Between Sets of Coef cients in Two Linear Regressions Econometrica 28 1960 pp 591 605 Chow G Random and Changing Coef cient Models In Z Griliches and M Intriligator eds Handbook of Econometrics Volume 2 North Holland Amsterdam 1984 Christensen L and W Greene Economies of Scale in U S Electric Power Generation Journal of Political Economy 84 1976 pp 655 676 Christensen L D Jorgenson and L Lau Transcendental Logarithmic Utility Functions American Economic Review 65 1975 pp 367 383 Cleveland W Robust Locally Weighted Regression and Smoothing Scatter Plots Journal of the American Statistical Association 74 1979 pp 829 836 Cochrane D and G Orcutt Application of Least Squares Regression to Relationships Containing Autocorrelated Error Terms Journal of the American Statistical Association 44 1949 pp 32 61 Conniffe D Covariance Analysis and Seemingly Unrelated Regression Equations American Statistician 36 1982a pp 169 171 Conniffe D A Note on Seemingly Unrelated Regressions Econometrica 50 1982b pp 229 233 Conniffe D Estimating Regression Equations with Common Explanatory Variables But Unequal Numbers of Observations Journal of Econometrics 27 1985 pp 179 196 Conway D and H Roberts Reverse Regression Fairness and Employment Discrimination Journal of Business and Economic Statistics 1 1 1983 pp 75 85 Cooley T and S LeRoy A theoretical Macroeconomics A Critique Journal of Monetary Economics 16 1985 pp 283 308 Cornwell C and P Schmidt Panel Data with Cross Sectional Variation in Slopes

    Greene 50240

    book

    June 7 2002

    22 36

    References

    967

    as Well as in Intercept Econometrics Workshop Paper No 8404 Michigan State University Department of Economics 1984 Coulson N and R Robins Aggregate Economic Activity and the Variance of In ation Another Look Economics Letters 17 1985 pp 71 75 Council of Economic Advisors Economic Report of the President Washington DC United States Government Printing Of ce 1994 Cox D Tests of Separate Families of Hypotheses Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability Vol 1 Berkeley University of California Press 1961 Cox D Further Results on Tests of Separate Families of Hypotheses Journal of the Royal Statistical Society Series B 24 1962 pp 406 424 Cox D Analysis of Binary Data London Methuen 1970 Cox D Regression Models and Life Tables Journal of the Royal Statistical Society Series B 34 1972 pp 187 220 Cox D and D Oakes Analysis of Survival Data New York Chapman and Hall 1985 Cragg J On the Relative Small Sample Properties of Several StructuralEquation Estimators Econometrica 35 1967 pp 89 110 Cragg J Some Statistical Models for Limited Dependent Variables with Application to the Demand for Durable Goods Econometrica 39 1971 pp 829 844 Cragg J Estimation and Testing in Testing in Time Series Regression Models with Heteroscedastic Disturbances Journal of Econometrics 20 1982 pp 135 157 Cragg J More Ef cient Estimation in the Presence of Heteroscedasticity of Unknown Form Econometrica 51 1983 pp 751 763 Cragg J Using Higher Moments to Estimate the Simple Errors in Variables Model Rand Journal of Economics 28 0 1997 pp S71 S91

    Cragg J and R Uhler The Demand for Automobiles Canadian Journal of Economics 3 1970 pp 386 406 Cram r H Mathematical Methods of Statistics Princeton Princeton University Press 1948 Cramer J Predictive Performance of the Binary Logit Model in Unbalanced Samples Journal of the Royal Statistical Society Series D The Statistician 48 1999 pp 85 94 Cumby R J Huizinga and M Obstfeld Two Step Two Stage Least Squares Estimation in Models with Rational Expectations Journal of Econometrics 21 1983 pp 333 355 Dahlberg M and E Johansson An Examination of the Dynamic Behaviour of Local Governments Using GMM Bootstrapping Methods Journal of Applied Econometrics 15 2000 pp 401 416 Dastoor N Some Aspects of Testing Nonnested Hypotheses Journal of Econometrics 21 1983 pp 213 228 Davidson J Econometric Theory Oxford Blackwell 2000 Davidson R and J MacKinnon Several Tests for Model Speci cation in the Presence of Alternative Hypotheses Econometrica 49 1981 pp 781 793 Davidson R and J MacKinnon Convenient Speci cation Tests for Logit and Probit Models Journal of Econometrics 25 1984 pp 241 262 Davidson R and J MacKinnon Testing Linear and Loglinear Regressions Against Box Cox Alternatives Canadian Journal of Economics 18 1985 pp 499 517 Davidson R and J MacKinnon Estimation and Inference in Econometrics New York Oxford University Press 1993 Deaton A Demand Analysis In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 1 Amsterdam North Holland 1983 Deaton A and J Muellbauer Economics and Consumer Behavior New York Cambridge University Press 1980

    Greene 50240

    book

    June 7 2002

    22 36

    968

    References

    Deaton A Model Selection Procedures or Does the Consumption Function Exist In Evaluating the Reliability of Macroceonomic Models Chow G and P Corsi eds John Wiley and Sons New York 1982 Debreu G The Coef cient of Resource Utilization Econometrica 19 3 1951 pp 273 292 Dempster A N Laird and D Rubin Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm Journal of the Royal Statistical Society Series B 39 1977 pp 1 38 DesChamps P Full Maximum Likelihood Estimation of Dynamic Demand Models Journal of Econometrics 82 1998 pp 335 359 Dezhbaksh H The Inappropriate Use of Serial Correlation Tests in Dynamic Linear Models Review of Economics and Statistics 72 1990 pp 126 132 Dhrymes P Distributed Lags Problems of Estimation and Formulation San Francisco Holden Day 1971 Dhrymes P Restricted and Unrestricted Reduced Forms Econometrica 41 1973 pp 119 134 Dhrymes P Limited Dependent Variables In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 2 Amsterdam North Holland 1984 Dhrymes P Time Series Unit Roots and Cointegration New York Academic Press 1998 Dhrymes P Speci cation Tests in Simultaneous Equation Systems Journal of Econometrics 64 1994 pp 45 76 Dickey D W Bell and R Miller Unit Roots in Time Series Models Tests and Implications American Statistician 40 1 1986 pp 12 26 Dickey D and W Fuller Distribution of the Estimators for Autoregressive Time Series with a Unit Root Journal of the American Statistical Association 74 1979 pp 427 431 Dickey D and W Fuller Likelihood Ratio Tests for Autoregressive Time Series with

    a Unit Root Econometrica 49 1981 pp 1057 1072 Dickey D D Jansen and D Thornton A Primer on Cointegration with an Application to Money and Income Federal Reserve Bank of St Louis Review 73 2 1991 pp 58 78 Diebold F The Past Present and Future of Macroeconomic Forecasting Journal of Economic Perspectives 12 2 1998a pp 175 192 Diebold F Elements of Forecasting Cincinnati South Western Publishing 1998b Diebold F and M Nerlove Unit Roots in Economic Time Series A Selective Survey In T Bewley ed Advances in Econometrics Vol 8 New York JAI Press 1990 Dielman T Pooled Cross Sectional and Time Series Data Analysis New York MarcelDekker 1989 Diewert E Applications of Duality Theory In M Intriligator and D Kendrick Frontiers in Quantitative Economics Amsterdam North Holland 1974 Diggle R P Liang and S Zeger Analysis of Longitudinal Data Oxford University Press Oxford 1994 Ding Z C Granger and R Engle A Long Memory Property of Stock Returns and a New Model Journal of Empirical Finance 1 1993 pp 83 106 Domowitz I and C Hakkio Conditional Variance and the Risk Premium in the Foreign Exchange Market Journal of International Economics 19 1985 pp 47 66 Doan T RATS User s Manual Evanston Ill Estima 1996 Doob J Stochastic Process John Wiley and Sons New York 1953 Duncan G Sample Selectivity as a Proxy Variable Problem On the Use and Misuse of Gaussian Selectivity Corrections Research in Labor Economics Supplement 2 1983 pp 333 345 Duncan G A Semiparametric Censored Regression Estimator Journal of Econometrics 31 1986a pp 5 34

    Greene 50240

    book

    June 7 2002

    22 36

    References

    969

    Duncan G ed Continuous Discrete Econometric Models with Unspeci ed Error Distribution Journal of Econometrics 32 1 1986b pp 1 187 Durbin J Errors in Variables Review of the International Statistical Institute 22 1954 pp 23 32 Durbin J Testing for Serial Correlation in Least Squares Regression When Some of the Regressors Are Lagged Dependent Variables Econometrica 38 1970 pp 410 421 Durbin J and G Watson Testing for Serial Correlation in Least Squares Regression I Biometrika 37 1950 pp 409 428 Durbin J and G Watson Testing for Serial Correlation in Least Squares Regression II Biometrika 38 1951 pp 159 178 Durbin J and G Watson Testing for Serial Correlation in Least Squares Regression III Biometrika 58 1971 pp 1 42 Dwivedi T and K Srivastava Optimality of Least Squares in the Seemingly Unrelated Regressions Model Journal of Econometrics 7 1978 pp 391 395 Efron B Regression and ANOVA with Zero One Data Measures of Residual Variation Journal of the American Statistical Association 73 1978 pp 113 212 Efron B Bootstrapping Methods Another Look at the Jackknife Annals of Statistics 7 1979 pp 1 26 Efron B and R Tibshirani An Introduction to the Bootstrap New York Chapman and Hall 1993 Eicker F Limit Theorems for Regression with Unequal and Dependent Errors In L LeCam and J Neyman Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability Berkeley University of California Press 1967 pp 59 82 Elliot G T Rothenberg and J Stock Ef cient Tests for an Autoregressive Unit Root Econometrica 64 1996 pp 813 836 Enders W Applied Econometric Time Series New York John Wiley and Sons 1995

    Engle R Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom In ations Econometrica 50 1982 pp 987 1008 Engle R Estimates of the Variance of U S In ation Based on the ARCH Model Journal of Money Credit and Banking 15 1983 pp 286 301 Engle R Wald Likelihood Ratio and Lagrange Multiplier Tests in Econometrics In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 2 Amsterdam North Holland 1984 Engle R and C Granger Co integration and Error Correction Representation Estimation and Testing Econometrica 35 1987 pp 251 276 Engle R and D Hendry Testing Super Exogeneity and Invariance Journal of Econometrics 56 1993 pp 119 139 Engle R D Hendry and J Richard Exogeneity Econometrica 51 1983 pp 277 304 Engle R D Hendry and D Trumble Small Sample Properties of ARCH Estimators and Tests Canadian Journal of Economics 18 1985 pp 66 93 Engle R and D Kraft Multiperiod Forecast Error Variances of In ation Estimated from ARCH Models In A Zellner ed Applied Time Series Analysis of Economic Data Washington D C Bureau of the Census 1983 Engle R D Lilen and R Robins Estimating Time Varying Risk Premia in the Term Structure The ARCH M Model Econometrica 55 1987 pp 391 407 Engle R and D McFadden eds Handbook of Econometrics Vol 4 Amsterdam North Holland 1994 Engle R and M Rothschild ARCH Models in Finance Journal of Econometrics 52 1992 pp 1 311 Engel R and B Yoo Forecasting and Testing in Cointegrated Systems Journal of Econometrics 35 1987 pp 143 159 Estes E and B Honore Partially Linear Regression Using one Nearest Neighbor

    Greene 50240

    book

    June 7 2002

    22 36

    970

    References

    Manuscript Department of Economics Princeton University 1995 Evans M N Hastings and B Peacock Statistical Distributions 2nd ed New York John Wiley and Sons 1993 Evans G and N Savin Testing for Unit Roots I Econometrica 49 1981 pp 753 779 Evans G and N Savin Testing for Unit Roots II Econometrica 52 1984 pp 1241 1269 Fair R A Note on Computation of the Tobit Estimator Econometrica 45 1977 pp 1723 1727 Fair R A Theory of Extramarital Affairs Journal of Political Economy 86 1978 pp 45 61 Fair R Speci cation and Analysis of Macroeconomic Models Cambridge Harvard University Press 1984 Farrell M The Measurement of Productive Ef ciency Journal of the Royal Statistical Society Series A General 120 part 3 1957 pp 253 291 Feibig D Seemingly Unrelated Regression in Baltagi B ed A Companion to Theoretical Econometrics Blackwell Oxford 2001 Feibig D R Bartels and D Aigner A Random Coef cient Approach to the Estimation of End Use Load Pro les Journal of Econometrics 50 1991 pp 297 328 Feldstein M The Error of Forecast in Econometric Models When the ForecastPeriod Exogenous Variables are Stochastic Econometrica 39 1971 pp 55 60 Fernandez A and J Rodriguez Poo Estimation and Testing in Female Labor Participation Models Parametric and Semiparametric Models Econometric Reviews 16 1997 pp 229 248 Fernandez L Nonparametric Maximum Likelihood Estimation of Censored Regression Models Journal of Econometrics 32 1 1986 pp 35 38 Fin T and P Schmidt A Test for the Tobit Speci cation versus an Alternative Suggested by Cragg Review of Economics and Statistics 66 1984 pp 174 177

    Finney D Probit Analysis Cambridge Cambridge University Press 1971 Fiorentini G G Calzolari and L Panattoni Analytic Derivatives and the Computation of GARCH Estimates Journal of Applied Econometrics 11 1996 pp 399 417 Fisher F Tests of Equality Between Sets of Coef cients in Two Linear Regressions An Expository Note Econometrica 28 1970 pp 361 366 Fisher G and D Nagin Random Versus Fixed Coef cients Coef cient Quantal Choice Models In C Manski and D McFadden eds Structural Analysis of Discrete Data with Econometric Applications Cambridge MIT Press 1981 Fisher R The Theory of Statistical Estimation Proceedings of the Cambridge Philosophical Society 22 1925 pp 700 725 Fletcher R Practical Methods of Optimization New York John Wiley and Sons 1980 Florens J D Fougere and M Mouchart Duration Models In L Matyas and P Sevestre The Econometrics of Panel Data 2nd ed Norwell Mass Kluwer 1996 Fomby T C Hill and S Johnson Advanced Econometric Methods Needham Mass Springer Verlag 1984 French K W Schwert and R Stambaugh Expected Stock Returns and Volatility Journal of Financial Economics 19 1987 pp 3 30 Friedman M A Theory of the Consumption Function Princeton Princeton University Press 1957 Frisch R Editorial Econometrica 1 1933 pp 1 4 Frisch R and F Waugh Partial Time Regressions as Compared with Individual Trends Econometrica 1 1933 pp 387 401 Fry J T Fry and K McLaren The Stochastic Speci cation of Demand Share Equatrions Restricting Budget Shares to the

    Greene 50240

    book

    June 7 2002

    22 36

    References

    971

    Unit Simplex Journal of Econometrics 73 1996 pp 377 386 Fuller W Introduction to Statistical Time Series New York John Wiley and Sons 1976 Fuller W and G Battese Estimation of Linear Models with Crossed Error Structure Journal of Econometrics 2 1974 pp 67 78 Gabrielsen A Consistency and Identi ability Journal of Econometrics 8 1978 pp 261 263 Gali J How Well Does the IS LM Model Fit Postwar U S Data Quarterly Journal of Economics 107 1992 pp 709 738 Gallant A Nonlinear Statistical Models New York John Wiley and Sons 1987 Gallant A and A Holly Statistical Inference in an Implicit Nonlinear Simultaneous Equation in the Context of Maximum Likelihood Estimation Econometrica 48 1980 pp 697 720 Gallant R and H White A Uni ed Theory of Estimation and Inference for Nonlinear Dynamic Models Oxford Basil Blackwell 1988 Garber S and S Klepper Extending the Classical Normal Errors in Variables Model Econometrica 48 1980 pp 1541 1546 Garber S and D Poirier The Determinants of Aerospace Pro t Rates Southern Economic Journal 41 1974 pp 228 238 Gaver K and M Geisel Discriminating Among Alternative Models Bayesian and Non Bayesian Methods In P Zarembka ed Frontiers in Econometrics New York Academic Press 1974 Gelfand A and A Smith Sampling Based Approaches to Calculating Marginal Densities Journal of the American Statistical Association 85 1990 pp 398 409 Gelman A J Conlen H Stern and D Rubin Bayesian Data Analysis Suffolk Chapman and Hall 1995 Ger n M Parametric and Semi Parametric Estimation of the Binary Response Model Journal of Applied Econometrics 11 1996 pp 321 340

    Geweke J Inference and Causality in Econometric Time Series Models In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 2 Amsterdam North Holland 1984 Geweke J Exact Inference in the Inequality Constrained Normal Linear Regression Model Journal of Applied Econometrics 2 1986 pp 127 142 Geweke J Antithetic Acceleration of Monte Carlo Integration in Bayesian Inference Journal of Econometrics 38 1988 pp 73 90 Geweke J Bayesian Inference in Econometric Models Using Monte Carlo Integration Econometrica 57 1989 pp 1317 1340 Geweke J M Keane and D Runkle Alternative Computational Approaches to Inference in the Multinomial Probit Model Review of Economics and Statistics 76 1994 pp 609 632 Geweke J M Keane and D Runkle Statistical Inference in the Multinomial Multiperiod Probit Model Journal of Econometrics 81 1 1997 pp 125 166 Geweke J and R Meese Estimating Regression Models of Finite but Unknown Order International Economic Review 22 1981 pp 55 70 Geweke J R Meese and W Dent Comparing Alternative Tests of Causality in Temporal Systems Analytic Results and Experimental Evidence Journal of Econometrics 21 1983 pp 161 194 Geweke J and S Porter Hudak The Estimation and Application of Long Memory Time Series Models Journal of TimeSeries Analysis 4 1983 pp 221 238 Godfrey L Testing Against General Autoregressive and Moving Average Error Models When the Regressors Include Lagged Dependent Variables Econometrica 46 1978 pp 1293 1302 Godfrey L Misspeci cation Tests in Econometrics Cambridge Cambridge University Press 1988 Godfrey L and H Pesaran Tests of Nonnested Regression Models After

    Greene 50240

    book

    June 7 2002

    22 36

    972

    References

    Estimation by Instrumental Variables or Least Squares Journal of Econometrics 21 1983 pp 133 154 Godfrey L and M Wickens Tests of Misspeci cation Using Locally Equivalent Alternative Models In G Chow and P Corsi eds Evaluating the Reliability of Econometric Models New York John Wiley and Sons 1982 pp 71 99 Goffe W G Ferrier and J Rodgers Global Optimization of Statistical Functions with Simulated Annealing Journal of Econometrics 60 1 2 1994 pp 65 100 Goldberger A Best Linear Unbiased Prediction in the Generalized Regression Model Journal of the American Statistical Association 57 1962 pp 369 375 Goldberger A Econometric Theory New York John Wiley and Sons 1964 Goldberger A Estimation of a Regression Coef cient Matrix Containing a Block of Zeroes University of Wisconsin SSRI EME Number 7002 1970 Goldberger A Selection Bias in Evaluating Treatment Effects Some Formal Illustrations Discussion Paper 123 72 Institute for Research on Poverty University of Wisconsin Madison 1972 Goldberger A Linear Regression After Selection Journal of Econometrics 15 1981 pp 357 366 Goldberger A Abnormal Selection Bias In S Karlin T Amemiya and L Goodman eds Studies in Econometrics Time Series and Multivariate Statistics New York Academic Press 1983 Goldberger A A Course in Econometrics Harvard University Press Cambridge 1991 Goldfeld S The Demand for Money Revisited Brookings Papers on Economic Activity 3 Washington D C Brookings Institution 1973 Goldfeld S and R Quandt Some Tests for Homoscedasticity Journal of the American Statistical Association 60 1965 pp 539 547 Goldfeld S and R Quandt Nonlinear Simultaneous Equations Estimation and

    Prediction International Economic Review 9 1968 pp 113 136 Goldfeld S and R Quandt Nonlinear Methods in Econometrics Amsterdam North Holland 1971 Goldfeld S and R Quandt GQOPT A Package for Numerical Optimization of Functions Department of Economics Princeton University 1972 Goldfeld S R Quandt and H Trotter Maximization by Quadratic Hill Climbing Econometrica 1966 pp 541 551 Gordin M The Central Limit Theorem for Stationary Processes Soviet Mathematical Dokl 10 1969 pp 1174 1176 Gourieroux C and A Monfort Testing Non Nested Hypotheses In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 4 Amsterdam North Holland 1994 Gourieroux C and A Monfort Testing Encompassing and Simulating Dynamic Econometric Models Econometric Theory 11 1995 pp 195 228 Gourieroux C and A Monfort SimulationBased Methods Econometric Methods Oxford Oxford University Press 1996 Gourieroux C A Monfort E Renault and A Trognon Generalized Residuals Journal of Econometrics 34 1987 pp 5 32 Gourieroux C A Monfort and A Trognon Testing Nested or Nonnested Hypotheses Journal of Econometrics 21 1983 pp 83 115 Gourieroux C A Monfort and A Trognon Pseudo Maximum Likelihood Methods Applications to Poisson Models Econometrica 52 1984 pp 701 720 Granger C Investigating Causal Relations by Econometric Models and CrossSpectral Methods Econometrica 37 1969 pp 424 438 Granger C Some Properties of Time Series Data and their Use in Econometric Model Speci cation Journal of Econometrics 16 1981 pp 121 130

    Greene 50240

    book

    June 7 2002

    22 36

    References

    973

    Granger C and Z Ding Varieties of Long Memory Models Journal of Econometrics 73 1996 pp 61 78 Granger C and R Joyeux An Introduction to Long Memory Time Series Models and Fractional Differencing Journal of Time Series Analysis 1 1980 pp 15 39 Granger C and P Newbold Spurious Regressions in Econometrics Journal of Econometrics 2 1974 pp 111 120 Granger C and P Newbold Forecasting Economic Time Series 2nd ed New York Academic Press 1996 Granger C and M Pesaran A Decision Theoretic Approach to Forecast Evaluation in W S Chan W Li and H Tong eds Statistics and Finance An Interface Imperial College Press London 2000 Granger C and M Watson Time Series and Spectral Methods in Econometrics In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 2 Amsterdam North Holland 1984 Greenberg E and C Webster Advanced Econometrics A Bridge to the Literature New York John Wiley and Sons 1983 Greene W Maximum Likelihood Estimation of Econometric Frontier Functions Journal of Econometrics 13 1980a pp 27 56 Greene W On the Asymptotic Bias of the Ordinary Least Squares Estimator of the Tobit Model Econometrica 48 1980b pp 505 514 Greene W Sample Selection Bias as a Speci cation Error Comment Econometrica 49 1981 pp 795 798 Greene W Estimation of Limited Dependent Variable Models by Ordinary Least Squares and the Method of Moments Journal of Econometrics 21 1983 pp 195 212 Greene W A Gamma Distributed Stochastic Frontier Model Journal of Econometrics 46 1990 pp 141 163 Greene W A Statistical Model for Credit Scoring Working Paper No EC 92 29 New York University Department of

    Economics Stern School of Business 1992 Greene W Econometric Analysis 2nd ed Englewood Cliffs N J Prentice Hall 1993 Greene W Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models Working Paper No EC 94 10 Department of Economics Stern School of Business New York University 1994 Greene W LIMDEP Version 7 0 User s Manual Bellport N Y Econometric Software 1995a pp 234 241 Greene W Count Data Manuscript Department of Economics Stern School of Business New York University 1995b Greene W Sample Selection in the Poisson Regression Model Working Paper No EC 95 6 Department of Economics Stern School of Business New York University 1995c Greene W Models for Count Data Manuscript Department of Economics Stern School of Business NYU 1996a Greene W Marginal Effects in the Bivariate Probit Model Working Paper No 96 11 Department of Economics Stern School of Business New York University 1996b Greene W FIML Estimation of Sample Selection Models for Count Data Working Paper No 97 02 Department of Economics Stern School of Business New York University 1997a Greene W Frontier Production Functions In M Pesaran and P Schmidt Handbook of Applied Econometrics Volume II Microeconomics London Blackwell Publishers 1997b Greene W Gender Economics Courses in Liberal Arts Colleges Further Results Journal of Economic Education 29 4 1998 pp 291 300 Greene W Marginal Effects in the Censored Regression Model Economics Letters 64 1 1999 pp 43 50 Greene W Fixed and Random Effects in Nonlinear Models Working Paper EC01 01 Stern School of Business Department of Economics 2001

    Greene 50240

    book

    June 7 2002

    22 36

    974

    References

    Greene W Convenient Estimators for Binary Choice Models with Panel Data Working Paper EC 02 05 Department of Economics Stern School of Business NYU 2002 Greene W and D Hensher Multinomial Logit and Discrete Choice Models In Greene W LIMDEP Version 7 0 User s Manual Revised Plainview N Y Econometric Software Inc 1997 Greene W and T Seaks The Restricted Least Squares Estimator A Pedagogical Note Review of Economics and Statistics 73 1991 pp 563 567 Greenstadt J On the Relative Ef ciencies of Gradient Methods Mathematics of Computation 1967 pp 360 367 Grif ths W C Hill and G Judge Learning and Practicing Econometrics John Wiley and Sons New York 1993 Griliches Z Distributed Lags A Survey Econometrica 35 1967 pp 16 49 Griliches Z Economic Data Issues In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 3 Amsterdam North Holland 1986 Griliches Z and P Rao Small Sample Properties of Several Two Stage Regression Methods in the Context of Autocorrelated Errors Journal of the American Statistical Association 64 1969 pp 253 272 Grogger J and R Carson Models for Truncated Counts Journal of Applied Econometrics 6 1991 pp 225 238 Gronau R Wage Comparisons A Selectivity Bias Journal of Political Economy 82 1974 pp 1119 1149 Grunfeld Y The Determinants of Corporate Investment Unpublished Ph D thesis Department of Economics University of Chicago 1958 Grunfeld Y and Z Griliches Is Aggregation Necessarily Bad Review of Economics and Statistics 42 1960 pp 1 13 Guilkey D Alternative Tests for a FirstOrder Vector Autoregressive Error Speci cation Journal of Econometrics 2 1974 pp 95 104

    Guilkey D K Lovell and R Sickles A Comparison of the Performance of Three Flexible Functional Forms International Economic Review 24 1983 pp 591 616 Guilkey D and P Schmidt Estimation of Seemingly Unrelated Regressions with Vector Autoregressive Errors Journal of the American Statistical Association 1973 pp 642 647 Gujarati D Basic Econometrics 3rd ed New York McGraw Hill 1995 Gurmu S Tests for Detecting Overdispersion in the Positive Poisson Regression Model Journal of Business and Economic Statistics 9 1991 pp 215 222 Gurmu S P Rilstone and S Stern Semiparametric Estimation of Count Regression Models Journal of Econometrics 88 1 1999 pp 123 150 Gurmu S and P Trivedi Recent Developments in Models of Event Counts A Survey Manuscript Department of Economics Indiana University 1994 Haavelmo T The Statistical Implications of a System of Simultaneous Equations Econometrica 11 1943 pp 1 12 Haitovsky Y Missing Data in Regression Analysis Journal of the Royal Statistical Society Series B 1968 pp 67 82 Hajivassiliou V Smooth Simulation Estimation of Panel Data LDV Models Department of Economics Yale University 1990 Hall A and A Sen Structural Stability Testing in Models Estimated by Generalized Method of Moments Journal of Business and Economics and Statistics 17 3 1999 pp 335 348 Hall B TSP Version 4 0 Reference Manual Stanford Calif TSP International 1982 Hall B Software for the Computation of Tobit Model Estimates Journal of Econometrics 24 1984 pp 215 222 Hall R Stochastic Implications of the Life Cycle Permanent Income Hypothesis Theory and Evidence Journal of Political Economy 86 6 1978 pp 971 987 Hamilton J Time Series Analysis Princeton Princeton University Press 1994

    Greene 50240

    book

    June 7 2002

    22 36

    References

    975

    Hansen B Testing for Parameter Instability in Linear Models Journal of Policy Modeling 14 1992 pp 517 533 Hansen B Approximate Asymptotic P Values for Structural Change Tests Journal of Business and Economic Statistics 15 1 1997 pp 60 67 Hansen B Testing for Structural Chaange in Conditional Models Journal of Econometrics 97 2000 pp 93 115 Hansen L Large Sample Properties of Generalized Method of Moments Estimators Econometrica 50 1982 pp 1029 1054 Hansen L J Heaton and A Yaron Finite Sample Properties of Some Alternative GMM Estimators Journal of Business and Economic Statistics 14 3 1996 pp 262 280 Hansen L and K Singleton Generalized Instrumental Variable Estimation of Nonlinear Rational Expectations Models Econometrica 50 1982 pp 1269 1286 Hansen L and K Singleton Ef cient Estimation of Asset Pricing Models with Moving Average Errors Manuscript Department of Economics Carnegie Mellon University 1988 Hardle W Applied Nonparametric Regres sion New York Cambridge University Press 1990 Hardle W and C Manski ed Nonparamet ric and Semiparametric Approaches to Discrete Response Analysis Journal of Econometrics 58 1993 pp 1 274 Harvey A Estimating Regression Models with Multiplicative Heteroscedasticity Econometrica 44 1976 pp 461 465 Harvey A Forecasting Structural Time Series Models and the Kalman Filter New York Cambridge University Press 1989 Harvey A The Econometric Analysis of Time Series 2nd ed Cambridge MIT Press 1990 Harvey A and G Phillips A Comparison of the Power of Some Tests for Heteroscedasticity in the General Linear

    Model Journal of Econometrics 2 1974 pp 307 316 Hashimoto N and K Ohtani An Exact Test for Linear Restrictions in Seemingly Unrelated Regressions with the Same Regressors Economics Letters 32 1990 pp 243 246 Hatanaka M An Ef cient Estimator for the Dynamic Adjustment Model with Autocorrelated Errors Journal of Econometrics 2 1974 pp 199 220 Hatanaka M Several Ef cient Two Step Estimators for the Dynamic Simultaneous Equations Model with Autoregressive Disturbances Journal of Econometrics 4 1976 pp 189 204 Hatanaka M Time Series Based Econometrics New York Oxford University Press 1996 Hausman J An Instrumental Variable Approach to Full Information Estimators for Linear and Certain Nonlinear Models Econometrica 43 1975 pp 727 738 Hausman J Speci cation Tests in Econometrics Econometrica 46 1978 pp 1251 1271 Hausman J Speci cation and Estimation of Simultaneous Equations Models In Z Griliches and M Intriligator eds Handbook of Econometrics Amsterdam North Holland 1983 Hausman J B Hall and Z Griliches Economic Models for Count Data with an Application to the Patents R D Relationship Econometrica 52 1984 pp 909 938 Hausman J and A Han Flexible Parametric Estimation of Duration and Competing Risk Models Journal of Applied Econometrics 5 1990 pp 1 28 Hausman J and D McFadden A Speci cation Test for the Multinomial Logit Model Econometrica 52 1984 pp 1219 1240 Hausman J and P Ruud Specifying and Testing Econometric Models for Rank Ordered Data with an Application to the Demand for Mobile and Portable Telephones Working Paper No 8605

    Greene 50240

    book

    June 7 2002

    22 36

    976

    References

    University of California Berkeley Department of Economics 1986 Hausman J and W Taylor Panel Data and Unobservable Individual Effects Econometrica 49 1981 pp 1377 1398 Hausman J and D Wise Social Experimentation Truncated Distributions and Ef cient Estimation Econometrica 45 1977 pp 919 938 Hausman J and D Wise A Conditional Probit Model for Qualitative Choice Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences Econometrica 46 1978 pp 403 426 Hayashi F Econometrics Princeton Princeton University Press 2000 Heckman J The Common Structure of Statistical Models of Truncation Sample Selection and Limited Dependent Variables and a Simple Estimator for Such Models Annals of Economic and Social Measurement 5 1976 pp 475 492 Heckman J Simple Statistical Models for Discrete Panel Data Developed and Applied to the Hypothesis of True State Dependence against the Hypothesis of Spurious State Dependence Annalse de l INSEE 30 1978 pp 227 269 Heckman J Sample Selection Bias as a Speci cation Error Econometrica 47 1979 pp 153 161 Heckman J Statistical Models for Discrete Panel Data In Structural Analysis of Discrete Data with Econometric Applications ed C Manski and D McFadden Cambridge MIT Press 1981a Heckman J Heterogeneity and State Dependence In Studies of Labor Markets by S Rosen ed NBER University of Chicago Press Chicago 1981b Heckman J Hetreogeneity and State Dependence In S Rosen ed Studies in Labor Markets University of Chicago Press Chicago 1981c Heckman J Varieties of Selection Bias American Economic Review 80 1990 pp 313 318 Heckman J and T MaCurdy A Life Cycle Model of Maimly Labor Supply Review

    of Economic Studies 47 1980 pp 247 283 Heckman J and T MaCurdy A Simultaneous Equations Linear Probability Model Canadian Journal of Economics 18 1985 pp 28 37 Heckman J and B Singer Econometric Duration Analysis Journal of Econometrics 24 1984a pp 63 132 Heckman J and B Singer A Method for Minimizing the Impact of Distributional Assumptions in Econometric Models for Duration Data Econometrica 52 1984b pp 271 320 Heckman J and J Snyder Linear Probability Models of the Demand for Attributes with an Empirical Application to Estimating the Preferences of Legislators Rand Journal of Economics 28 0 1997 Heckman J and R Willis Estimation of a Stochastic Model of Reproduction An Econometric Approach In N Terleckyj ed Household Production and Consumption New York National Bureau of Economic Research 1976 Heilbron D Generalized Linear Models for Altered Zero Probabilities and Overdispersion in Count Data Technical Report Department of Epidemiology and Biostatistics University of California San Francisco 1989 Hendry D Econometrics Alchemy or Science Economica 47 1980 pp 387 406 Hendry D Monte Carlo Experimentation in Econometrics In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 2 Amsterdam North Holland 1984 Hendry D Econometrics Alchemy or Science Oxford Blackwell Publishers 1993 Hendry D Dynamic Econometrics Oxford Oxford University Press 1995 Hendry D and Ericsson An Econometric Analysis of UK Money Demand in M Friedman and A A Schwartz eds American Economic Review 81 1991 pp 8 38

    Greene 50240

    book

    June 7 2002

    22 36

    References

    977

    Hendry D and J Doornik PC Give 8 London International Thomson Publishers 1986 Hendry D A Pagan and D Sargan Dynamic Speci cation In Intriligator M and Griliches Z eds Handbook of Econometrics Vol 2 Amsterdam North Holland 1984 Hensher D Simultaneous Estimation of Hierarchical Logit Mode Choice Models Working Paper No 24 MacQuarie University School of Economic and Financial Studies 1986 Hensher D ed Travel Behavior Research The Leading Edge Rergamon Press Amsterdam 2001 Hensher D Louviere J and J Swait Stated Choice Methods Analysis and Applications Cambridge University Press Cambridge 2000 Hildebrand G and T Liu Manufacturing Production Functions in the United States Ithaca N Y Cornell University Press 1957 Hildreth C and C Houck Some Estimators for a Linear Model with Random Coef cients Journal of the American Statistical Association 63 1968 pp 584 595 Hildreth C and J Lu Demand Relations with Autocorrelated Disturbances Technical Bulletin No 276 Michigan State University Agricultural Experiment Station 1960 Hill C and L Adkins Collinearity in B Baltagi ed A Companion to Theoretical Econometrics Oxford Blackwell 2001 Hite S Women and Love New York Alfred A Knopf 1987 Holt M Autocorrelation Speci cation in Singular Equation Systems A Further Look Economics Letters 58 1998 pp 135 141 Holtz Eakin D Testing for Individual Effects in Autoregressive Models Journal of Econometrics 39 1988 pp 297 307 Holtz Eakin D W Newey and H Rosen Estimating Vector Autoregressions with Panel Data Econometrica 56 6 1988 pp 1371 1395

    Honore B and T Kyriazidou Estimation of a Panel Data Sample Selection Model Econometrica 65 6 1997 pp 1335 1364 Honore B and T Kyriazidou Panel Data Discrete Choice Models with Lagged Dependent Variables Econometrica 68 4 2000 pp 839 874 Horn D A Horn and G Duncan Estimating Heteroscedastic Variances in Linear Models Journal of the American Statistical Association 70 1975 pp 380 385 Horowitz J A Smoothed Maximum Score Estimator for the Binary Response Model Econometrica 60 1992 pp 505 531 Horowitz J Semiparametric Estimation of a Work Trip Mode Choice Model Journal of Econometrics 58 1993 pp 49 70 Horowitz J and G Neumann Speci cation Testing in Censored Regression Models Journal of Applied Econometrics 4 S 1989 pp S35 S60 Hosking J Fractional Differencing Biometrika 68 1981 pp 165 176 Hsiao C Some Estimation Methods for a Random Coef cient Model Econometrica 43 1975 pp 305 325 Hsiao C Identi cation In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 1 Amsterdam North Holland 1983 Hsiao C Analysis of Panel Data Cambridge University Press Cambridge 1986 Hsiao C Analysis of Panel Data New York Cambridge University Press 1986 Hsiao C Logit and Probit Models In L Matyas and P Sevestre eds The Econometrics of Panel Data Handbook of Theory and Applications Dordrecht Germany Kluwer Nijoff 1992 Hsiao C K Lahiri L Lee and H Pesaran Analysis of Panels and Limited Dependent Variable Models New York Cambridge University Press 1999 Huber P The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions In Proceedings of the Fifth Berkeley Symposium in Mathematical

    Greene 50240

    book

    June 7 2002

    22 36

    978

    References

    Statistics Vol 1 Berkeley University of California Press 1967 Hurd M Estimation in Truncated Samples When There Is Heteroscedasticity Journal of Econometrics 11 1979 pp 247 258 Hurst H Long Term Storage Capacity of Reservoirs Transactions of the American Society of Civil Engineers 116 1951 pp 519 543 Hyslop D State Dependence Serial Correlation and Heterogeneity in Labor Force Participation of Married Women Econometrica 67 6 1999 pp 1255 1294 Hwang H Estimation of a Linear SUR Model with Unequal Numbers of Observations Review of Economics and Statistics 72 1990 pp 510 515 Im E Unequal Numbers of Observations and Partial Ef ciency Gain Economics Letters 46 1994 pp 291 294 Imbens G and D Hyslop Bias from Classical and Other Forms of Measurement Error Journal of Business and Economic Statistics 19 2001 pp 141 149 Imhof J Computing the Distribution of Quadratic Forms in Normal Variables Biometrika 48 1980 pp 419 426 Inkmann J Misspeci ed Heteroscedasticity in the Panel Probit Model A Small Sample Comparison of GMM and SML Estimators Journal of Econometrics 97 2 2000 pp 227 259 Jain D N Vilcassim and P Chintagunta A Random Coef cients Logit Brand Choice Model Applied to Panel Data Journal of Business and Economic Statistics 12 3 1994 pp 317 328 Jakubson G The Sensitivity of Labor Supply Parameters to Unobserved Individual Effects Fixed and Random Effects Estimates in a Nonlinear Model Using Panel Data Journal of Labor Economics 6 1988 pp 302 329 Jarque C An Application of LDV Models to Household Expenditure Analysis in Mexico Journal of Econometrics 36 1987 pp 31 54

    Jayatissa W Tests of Equality Between Sets of Coef cients in Two Linear Regressions When Disturbance Variances are Unequal Econometrica 45 1977 pp 1291 1292 Jennrich R I The Asymptotic Properties of Nonlinear Least Squares Estimators Annals of Statistics 2 1969 pp 633 643 Jensen M A Monte Carlo Study on Two Methods of Calculating the MLE s Covariance Matrix in a Seemingly Unrelated Nonlinear Regression Econometric Reviews 14 1995 pp 315 330 Jobson J and W Fuller Least Squares Estimation When the Covariance Matrix and Parameter Vector are Functionally Related Journal of the American Statistical Association 75 1980 pp 176 181 Johansen S Estimation and Hypothesis Testing of Cointegrated Vectors in Gaussian VAR Models Econometrica 59 6 1991 pp 1551 1580 Johansen S A Representation of Vector Autoregressive Processes of Order 2 Econometric Theory 8 1992 pp 188 202 Johansen S Statistical Analysis of Cointegration Vectors Journal of Economic Dynamics and Control 12 1988 pp 231 254 Johansen S and K Juselius Maximum Likelihood Estimation and Inference on Cointegration with Applications for the Demand for Money Oxford Bulletin of Economics and Statistics 52 1990 pp 169 210 Johnson N S Kotz and A Kemp Distributions in Statistics Univariate Discrete Distributions 2nd ed New York John Wiley and Sons 1993 Johnson N S Kotz and A Balakrishnan Distributions in Statistics Continuous Univariate Distributions Vol 1 2nd ed New York John Wiley and Sons 1994 Johnson N S Kotz and N Balakrishnan Distributions in Statistics Continuous Univariate Distributions Vol 2 2nd ed New York John Wiley and Sons 1995 Johnson N and S Kotz Distributions in Statistics Continuous Multivariate

    Greene 50240

    book

    June 7 2002

    22 36

    References

    979

    Distributions New York John Wiley and Sons 1974 Johnson N S Kotz and N Balakrishnan Distributions in Statistics Discrete Multivariate Distributions New York John Wiley and Sons 1997 Johnson R and D Wichern Applied Multivariate Statistical Analysis 4th ed Englewood Cliffs N J Prentice Hall 1999 Johnston J Econometric Methods New York McGraw Hill 1984 Johnston J and J DiNardo Econometric Methods 4th ed New York McGrawHill 1997 Jondrow J K Lovell I Materov and P Schmidt On the Estimation of Technical Inef ciency in the Stochastic Frontier Production Function Model Journal of Econometrics 19 1982 pp 233 238 Jones J and J Landwehr Removing Heterogeneity Bias from Logit Model Estimation Marketing Science 7 1 1988 pp 41 59 Joreskog K A General Method for Estimating a Linear Structural Equation System In A Goldberger and O Duncan Structural Equation Models in the Social Sciences New York Academic Press 1973 Joreskog K and G Gruvaeus A Computer Program for Minimizing a Function of Several Variables Educational Testing Services Research Bulletin No 70 14 1970 Joreskog K and D Sorbom LISREL V User s Guide Chicago National Educational Resources 1981 Jorgenson D Rational Distributed Lag Functions Econometrica 34 1966 pp 135 149 Jorgenson D Econometric Methods for Modeling Producer Behavior In Z Griliches and M Intriligator Handbook of Econometrics Vol 3 Amsterdam North Holland 1983 Judd K Numerical Methods in Economics Cambridge MIT Press 1998 Judge G W Grif ths C Hill and T Lee The Theory and Practice of Econometrics New York John Wiley and Sons 1985

    Judge G C Hill W Grif ths T Lee and H Lutkepol An Introduction to the Theory and Practice of Econometrics New York John Wiley and Sons 1982 Kalb eisch J and R Prentice The Statistical Analysis of Failure Time Data New York John Wiley and Sons 1980 Kamlich R and S Polachek Discrimination Fact or Fiction An Examination Using an Alternative Approach Southern Economic Journal October 1982 pp 450 461 Kaplan E and P Meier Nonparametric Estimation from Incomplete Observations Journal of the American Statistical Association 53 1958 pp 457 481 Kay R and S Little Assessing the Fit of the Logistic Model A Case Study of Children with Haemolytic Uraemic Syndrome Applied Statistics 35 1986 pp 16 30 Keane M Simulation Estimators for Panel Data Models with Limited Dependent Variables in G Maddala and C Rao eds Handbook of Statistics Volume 11 Chapter 20 Amsterdam North Holland 1993 Keane M A Computationally Practical Simulation Estimator for Panel Data Econometrica 62 1 1994 pp 95 116 Kelejian H Two Stage Least Squares and Econometric Systems Linear in Parameters but Nonlinear in the Endogenous Variables Journal of the American Statistical Association 66 1971 pp 373 374 Kelly J Linear Cross Equation Constraints and the Identi cation Problem Econometrica 43 1975 pp 125 140 Kennan J The Duration of Contract Strikes in U S Manufacturing Journal of Econometrics 28 1985 pp 5 28 Kennedy W and J Gentle Statistical Computing New York Marcel Dekker 1980 Keuzenkamp H and J Magnus The Significance of Testing in Econometrics Journal of Econometrics 67 1 1995 pp 1 257 Keynes J The General Theory of Employment Interest and Money New York Harcourt Brace and Jovanovich 1936

    Greene 50240

    book

    June 7 2002

    22 36

    980

    References

    Kiefer N Testing for Independence in Multivariate Probit Models Biometrika 69 1982 pp 161 166 Kiefer N and Salmon M Testing Normality in Econometric Models Economics Letters 11 1983 pp 123 127 Kiefer N ed Econometric Analysis of Duration Data Journal of Econometrics 28 1 1985 pp 1 169 Kiefer N Economic Duration Data and Hazard Functions Journal of Economic Literature 26 1988 pp 646 679 Killian L Small Sample Con dence Intervals for Impulse Response Functions The Review of Economics and Statistics 80 2 1998 pp 218 230 Kim H and J Pollard Cube Root Asymptotics Annals of Statistics March 1990 pp 191 219 Kiviet J On Bias Inconsistency and Ef ciency of Some Estimators in Dynamic Panel Data Models Journal of Econometrics 68 1 1995 pp 63 78 Kiviet J G Phillips and B Schipp The Bias of OLS GLS and ZEF Estimators in Dynamic SUR Models Journal of Econometrics 69 1995 pp 241 266 Klein L Economic Fluctuations in the United States 1921 1941 New York John Wiley and Sons 1950 Klein R and R Spady An Ef cient Semiparametric Estimator for Discrete Choice Models Econometrica 61 1993 pp 387 421 Klepper S and E Leamer Consistent Sets of Estimates for Regressions with Errors in All Variables Econometrica 52 1983 pp 163 184 Kmenta J On Estimation of the CES Production Function International Economic Review 8 1967 pp 180 189 Kmenta J Elements of Econometrics New York Macmillan 1986 Kmenta J and R Gilbert Small Sample Properties of Alternative Estimators of Seemingly Unrelated Regressions Journal of the American Statistical Association 63 1968 pp 1180 1200

    Knapp L and T Seaks An Analysis of the Probability of Default on Federally Guaranteed Student Loans Review of Economics and Statistics 74 1992 pp 404 411 Knight F The Economic Organization New York Harper and Row 1933 Kobayashi M A Bounds Test of Equality Between Sets of Coef cients in Two Linear Regressions When Disturbance Variances Are Unequal Journal of the American Statistical Association 81 1986 pp 510 514 Koenker R A Note on Studentizing a Test for Heteroscedasticity Journal of Econometrics 17 1981 pp 107 112 Koenker R and G Bassett Regression Quantiles Econometrica 46 1978 pp 107 112 Koenker R and G Bassett Robust Tests for Heteroscedasticity Based on Regression Quantiles Econometrica 50 1982 pp 43 61 Krailo M and M Pike Conditional Multivariate Logistic Analysis of Strati ed Case Control Studies Applied Statistics 44 1 1984 pp 95 103 Kreuger A Economic Scene New York Times April 27 2000 P C2 Kreuger A and S Dale Estimating the Payoff to Attending a More Selective College NBER Cambridge Working Paper 7322 1999 Kumbhakar S and A Heshmati Technical Change and Total Factor Productivity Growth in Swedish Manufacturing Industries Econometric Reviews 15 1996 pp 275 298 Kumbhakar S and K Lovell Stochastic Frontier Analysis New York Cambridge University Press 2000 Kyriazidou E Estimation of a Panel Data Sample Selection Model Econometrica 65 1997 pp 1335 1364 Lambert D Zero In ated Poisson Regression with an Application to Defects in Manufacturing Technometrics 34 1 1992 pp 1 14

    Greene 50240

    book

    June 7 2002

    22 36

    References

    981

    Lancaster T The Incidental Parameters Problem since 1948 Journal of Econometrics 95 2 2000 pp 391 414 Lancaster T The Analysis of Transition Data New York Cambridge University Press 1990 Landers A Survey Chicago Tribune 1984 passim Lawless J Statistical Models and Methods for Lifetime Data New York John Wiley and Sons 1982 Leamer E Speci cation Searches Ad Hoc Inferences with Nonexperimental Data New York John Wiley and Sons 1978 L Ecuyer P Good Parameters and Implementations for Combined Multiple Recursive Random Number Generators Department of Information Science University of Montreal working paper 1998 LeCam L On Some Asymptotic Properties of Maximum Likelihood Estimators and Related Bayes Estimators University of California Publications in Statistics 1 1953 pp 277 330 Lee L Estimation of Error Components Models with ARMA p q Time Component An Exact GLS Approach Number 78 104 University of Minnesota Center for Economic Research 1978 Lee L Speci cation Tests for Poisson Regression Models International Economic Review 27 1986 pp 689 706 Lee M Method of Moments and Semiparametric Econometrics for Limited Dependent Variables New York SpringerVerlag 1996 Lee M Method of Moments and Semiparametric Econometris for Limited Dependent Variable Models Heidelberg Springer Verlag 1996 Lee M Limited Dependent Variable Models New York Cambridge University Press 1998 Leff N Dependency Rates and Savings Rates American Economic Review 59 5 1969 pp 886 896

    Lerman R and C Manski On the Use of Simulated Frequencies to Approximate Choice Probabilities In C Manski and D McFadden eds Structural Analysis of Discrete Data with Econometric Applications Cambridge MIT Press 1981 Levi M Errors in the Variables in the Presence of Correctly Measured Variables Econometrica 41 1973 pp 985 986 Lewbel A Semiparametric Qualitative Response Model Estimation with Unknown Heteroscedasticity or Instrumental Variables Journal of Econometrics 97 1 2000 pp 145 177 Lewbel A Semiparametric Estimation of Location and Other Discrete Choice Moments Econometric Theory 14 1997 pp 32 51 Lewbel A and B Honore Semiparametric Binary Choice Panel Data Models Without Strictly Exogenous Regressors Econometrica 2001 Forthcoming Lewis H Comments on Selectivity Biases in Wage Comparisons Journal of Political Economy 82 1974 pp 1149 1155 Li W S Ling and M McAleer A Survey of Recent Theoretical Results for Time Series Models with GARCH Errors Manuscript Institute for Social and Economic Research Osaka University Osaka 2001 Liang K and S Zeger Longitudinal Data Analysis Using Generalized Linear Models Biometrika 73 1986 pp 13 22 Lillard L and R Willis Dynamic Aspects of Earning Mobility Econometrica 46 1978 pp 985 1012 Lintner J Security Prices Risk and Maximal Gains from Diversi cation Journal of Finance 20 1965 pp 587 615 Litterman R Techniques of Forecasting Using Vector Autoregressions Working Paper No 15 Federal Reserve Bank of Minneapolis 1979 Litterman R Forecasting with Bayesian Vector Autoregressions Five Years of Experience Journal of Business and Economic Statistics 4 1986 pp 25 38

    Greene 50240

    book

    June 7 2002

    22 36

    982

    References

    Liu T Underidenti cation Structural Estimation and Forecasting Econometrica 28 1960 pp 855 865 Ljung G and G Box On a Measure of Lack of Fit in Time Series Models Biometrika 66 1979 pp 265 270 Lo A Long Term Memory in Stock Market Prices Econometrica 59 1991 pp 1297 1313 Loeve M Probability Theory New York Springer Verlag 1977 Long S Regression Models for Categorical and Limited Dependent Variables Thousand Oaks Calif Sage Publications 1997 Longley J An Appraisal of Least Squares Programs from the Point of the User Journal of the American Statistical Association 62 1967 pp 819 841 Louviere J D Hensher and J Swait Stated Choice Methods Analysis and Applications Cambridge Cambridge University Press 2000 Lucas R Econometric Policy Evaluation A Critique In K Brunner and A Meltzer eds The Phillips Curve and the Labor Market Amsterdam North Holland 1976 Lucas R Money Demand in the United States A Quantitative Review CarnegieRochester Conference Series on Public Policy 29 1988 pp 137 168 Lutkepohl H Introduction to Multiple Time Series Analysis New York Marcel Dekker 1993 MacDonald G and H White Some Large Sample Tests for Nonnormality in the Linear Regression Model Journal of the American Statistical Association 75 1980 pp 16 27 MacKinnon J and H White Some Heteroscedasticity Consistent Covariance Matrix Estimators with Improved Finite Sample Properties Journal of Econometrics 19 1985 pp 305 325 MacKinnon J H White and R Davidson Tests for Model Speci cation in the Presence of Alternative Hypotheses Some Further Results Journal of Econometrics 21 1983 pp 53 70

    Maddala G The Use of Variance Components Models in Pooling Cross Section and Time Series Data Econometrica 39 1971 pp 341 358 Maddala G Econometrics New York McGraw Hill 1977a Maddala G Limited Dependent Variable Models Using Panel Data Journal of Human Resources 22 1977b pp 307 338 Maddala G Limited Dependent and Qualitative Variables in Econometrics New York Cambridge University Press 1983 Maddala G Disequilibrium Self Selection and Switching Models In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 3 Amsterdam North Holland 1984 Maddala G Limited Dependent Variable Models Using Panel Data Journal of Human Resources 22 1987 pp 307 338 Maddala G Introduction to Econometrics 2nd Ed Macmillan New York 1992 Maddala G The Econometrics of Panel Data Vols I and II Brook eld Vt E E Elgar 1993 Maddala G and A Flores Lagunes Qualitative Response Models in B Baltagi ed A Companion to Theoretical Econometrics Oxford Blackwell 2001 Maddala G and I Kim Unit Roots Coiintegration and Structural Change Cambridge Cambridge University Press 1998 Maddala G and T Mount A Comparative Study of Alternative Estimators for Variance Components Models Journal of the American Statistical Association 68 1973 pp 324 328 Maddala G and F Nelson Speci cation Errors in Limited Dependent Variable Models Working Paper 96 National Bureau of Economic Research Cambridge Mass 1975 Magnac T State Dependence and Heterogeneity in Youth Unemployment Histories Working Paper INRA and CREST Paris 1997 Magnus J and H Neudecker Matrix Differential Calculus with Applications in

    Greene 50240

    book

    June 7 2002

    22 36

    References

    983

    Statistics and Econometrics New York John Wiley and Sons 1988 Malinvaud E Statistical Methods of Econometrics Amsterdam North Holland 1970 Mandy D and C Martins Filho Seemingly Unrelated Regressions Under Additive Heteroscedasticity Theory and Share Equation Applications Journal of Econometrics 58 1993 pp 315 346 Mann H and A Wald On the Statistical Treatment of Linear Stochastic Difference Equations Econometrica 11 1943 pp 173 220 Manski C The Maximum Score Estimator of the Stochastic Utility Model of Choice Journal of Econometrics 3 1975 pp 205 228 Manski C Semiparametric Analysis of Discrete Response Asymptotic Properties of the Maximum Score Estimator Journal of Econometrics 27 1985 pp 313 333 Manski C Operational Characteristics of the Maximum Score Estimator Journal of Econometrics 32 1986 pp 85 100 Manski C Semiparametric Analysis of the Random Effects Linear Model from Binary Response Data Econometrica 55 1987 pp 357 362 Manski C Anatomy of the Selection Problem Journal of Human Resources 24 1989 pp 343 360 Manski C Nonparametric Bounds on Treatment Effects American Economic Review 80 1990 pp 319 323 Manski C Analog Estimation Methods in Econometrics London Chapman and Hall 1992 Manski C Identi cation Problems in the Social Sciences Cambridge Harvard University Press 1995 Manski C and S Lerman The Estimation of Choice Probabilities from Choice Based Samples Econometrica 45 1977 pp 1977 1988 Manski C and S Thompson MSCORE A Program for Maximum Score Estimation of Linear Quantile Regressions from Binary Response Data Mimeo University

    of Wisconsin Madison Department of Economics 1986 Marcus A and W Greene The Determinants of Rating Assignment and Performance Working Paper CRC528 Center for Naval Analyses 1985 Mariano R Analytical Small Sample Distribution Theory in Econometrics The Simultaneous Equations Case International Economic Review 23 1982 pp 503 534 Mariano R Simultaneous Equation Model Estimators Statistical Properties in B Baltagi ed A Companion to Theoretical Econometrics Oxford Blackwell 2001 Markowitz H Portfolio Selection Ef cient Diversi cation of Investments New York John Wiley and Sons 1959 Marsaglia G and T Bray A Convenient Method of Generating Normal Variables SIAM Review 6 1964 pp 260 264 Martins M Parametric and Semiparametric Estimation of Sample Selection Models An Empirical Application to the Female Labour Force in Portugal Journal of Applied Econometrics 16 1 2001 pp 23 40 Matyas Lasclo Generalized Method of Moments Estimation Cambridge Cambridge University Press 1999 Matyas L and P Sevestre eds The Econometrics of Panel Data Handbook of Theory and Applications 2nd ed Dordrecht Kluwer Nijoff 1996 Matzkin R Nonparametric Identi cation and Estimation of Polytomous Choice Models Journal of Econometrics 58 1993 pp 137 168 Mazodier P and A Trognon Heteroscedasticity and Strati cation in Error Components Models Annales de l Insee 30 1978 pp 451 482 McAleer M The Signi cance of Testing Empirical Non Nested Models Journal of Econometrics 67 1995 pp 149 171 McAleer M G Fisher and P Volker Separate Misspeci ed Regressions and the U S Long Run Demand for Money

    Greene 50240

    book

    June 7 2002

    22 36

    984

    References

    Function Review of Economics and Statistics 64 1982 pp 572 583 McCallum B Relative Asymptotic Bias from Errors of Omission and Measurement Econometrica 40 1972 pp 757 758 McCallum B A Note Concerning Covariance Expressions Econometrica 42 1973 pp 581 583 McCullagh P and J Nelder Generalized Linear Models New York Chapman and Hall 1983 McCullough B Consistent Forecast Intervals When the Forecast Period Exogenous Variables Are Stochastic Journal of Forecasting 15 1996 pp 293 304 McCullough B Econometric Software Reliability E Views LIMDEP SHAZAM and TSP Journal of Applied Econometrics 14 2 1999 pp 191 202 McCullough B and C Renfro Benchmarks and Software Standards A Case Study of GARCH Procedures Journal of Economic and Social Measurement 25 2 1999 pp 27 37 McCullough B and H Vinod The Numerical Reliability of Econometric Software Journal of Economic Literature 1999 forthcoming McDonald J and R Mof tt The Uses of Tobit Analysis Review of Economics and Statistics 62 1980 pp 318 321 McElroy M Goodness of Fit for Seemingly Unrelated Regressions Glahn s R2 x and y Hooper s r 2 Journal of Econometrics 6 1977 pp 381 387 McFadden D Conditional Logit Analysis of Qualitative Choice Behavior In P Zarembka ed Frontiers in Econometrics New York Academic Press 1973 McFadden D The Measurement of Urban Travel Demand Journal of Public Economics 3 1974 pp 303 328 McFadden D Econometric Analysis of Qualitative Response Models In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 2 Amsterdam North Holland 1984

    McFadden D Regression Based Speci cation Tests for the Multinomial Logit Model Journal of Econometrics 34 1987 pp 63 82 McFadden D A Method of Simulated Moments for Estimation of Discrete Response Models Without Numerical Integration Econometrica 57 1989 pp 995 1026 McFadden D and K Train Mixed Multinomial Logit Models for Discrete Response Journal of Applied Econometrics 15 2000 pp 447 470 McFadden D and P Ruud Estimation by Simulation Review of Economics and Statistics 76 1994 pp 591 608 McKenzie C Micro t 4 0 Journal of Applied Econometrics 13 1998 pp 77 90 McLaren K Parsimonious Autocorrelation Corrections for Singular Demand Systems Economics Letters 53 1996 pp 115 121 Melenberg B and A van Soest Parametric and Semi Parametric Modelling of Vacation Expenditures Journal of Applied Econometrics 11 1 1996 pp 59 76 Merton R On Estimating the Expected Return on the Market Journal of Financial Economics 8 1980 pp 323 361 Messer K and H White A Note on Computing the Heteroscedasticity Consistent Covariance Matrix Using Instrumental Variable Techniques Oxford Bulletin of Economics and Statistics 46 1984 pp 181 184 Meyer B Semiparametric Estimation of Hazard Models Northwestern University Department of Economics 1988 Mills T Time Series Techniques for Economists New York Cambridge University Press 1990 Mills T The Econometric Modelling of Financial Time Series New York Cambridge University Press 1993 Mittelhammer R G Judge and D Miller Econometric Foundations Cambridge Cambridge University Press 2000

    Greene 50240

    book

    June 7 2002

    22 36

    References

    985

    Mizon G A Note to Autocorrelation Correctors Don t Journal of Econometrics 69 1 1995 pp 267 288 Mizon G and J Richard The Encompassing Principle and its Application to Testing Nonnested Models Econometrica 54 1986 pp 657 678 Moshino G and D Moro Autocorrelation Speci cation in Singular Equation Systems Economics Letters 46 1994 pp 303 309 Mroz T The Sensitivity of an Empirical Model of Married Women s Hours of Work to Economic and Statistical Assumptions Econometrica 55 1987 pp 765 799 Mullahy J Speci cation and Testing of Some Modi ed Count Data Models Journal of Econometrics 33 1986 pp 341 365 Mullahy J Weighted Least Squares Estimation of the Linear Probability Model Revisited Economics Letters 32 1990 pp 35 41 Mundlak Y On the Pooling of Time Series and Cross Sectional Data Econometrica 56 1978 pp 69 86 Murphy K and R Topel Estimation and Inference in Two Step Econometric Models Journal of Business and Economic Statistics 3 1985 pp 370 379 Nagin D and K Land Age Criminal Careers and Population Heterogeneity Speci cation and Estimation of a Nonparametric Mixed Poisson Model Criminology 31 3 1993 pp 327 362 Nakamura A and M Nakamura On the Relationships Among Several Speci cation Error Tests Presented by Durbin Wu and Hausman Econometrica 49 1981 pp 1583 1588 Nakamura A and M Nakamura Part Time and Full Time Work Behavior of Married Women A Model with a Doubly Truncated Dependent Variable Canadian Journal of Economics 1983 pp 229 257 Nakosteen R and M Zimmer Migration and Income The Question of Self

    Selection Southern Economic Journal 46 1980 pp 840 851 Nelder J and R Mead A Simplex Method for Function Minimization Computer Journal 7 1965 pp 308 313 Nelson C and H Kang Pitfalls in the Use of Time as an Explanatory Variable in Regression Journal of Business and Economic Statistics 2 1984 pp 73 82 Nelson C and C Plosser Trends and Random Walks in Macroeconomic Time Series Some Evidence and Implications Journal of Monetary Economics 10 1982 pp 139 162 Nelson F A Test for Misspeci cation in the Censored Normal Model Econometrica 49 1981 pp 1317 1329 Nerlove M Essays in Panel Data Econometrics Cambridge University Press Cambridge 2003 Nerlove M Returns to Scale in Electricity Supply In C Christ ed Measurement in Economics Studies in Mathematical Economics and Econometrics in Memory of Yehuda Grunfeld Stanford Calif Stanford University Press 1963 Nerlove M Further Evidence on the Estimation of Dynamic Relations From a Time Series of Cross Sections Econometrica 39 1971a pp 359 382 Nerlove M A Note on Error Components Models Econometrica 39 1971b pp 383 396 Nerlove M Lags in Economic Behavior Econometrica 40 1972 pp 221 251 Nerlove M and S Press Univariate and Multivariate Log Linear and Logistic Models RAND R1306 EDA NIH Santa Monica 1973 Nerlove M and K Wallis Use of the Durbin Watson Statistic in Inappropriate Situations Econometrica 34 1966 pp 235 238 Newbold P Signi cance Levels of the BoxPierce Portmanteau Statistic in Finite Samples Biometrika 64 1977 pp 67 71

    Greene 50240

    book

    June 7 2002

    22 36

    986

    References

    Newbold P Testing Causality Using Ef ciently Parameterized Vector ARMA Models Applied Mathematics and Computation 20 1986 pp 184 199 Newey W A Method of Moments Interpretation of Sequential Estimators Economics Letters 14 1984 pp 201 206 Newey W Maximum Likelihood Speci cation Testing and Conditional Moment Tests Econometrica 53 1985a pp 1047 1070 Newey W Generalized Method of Moments Speci cation Testing Journal of Econometrics 29 1985b pp 229 256 Newey W Speci cation Tests for Distributional Assumptions in the Tobit Model Journal of Econometrics 34 1986 pp 125 146 Newey W The Asymptotic Variance of Semiparametric Estimators Econometrica 62 1994 pp 1349 1382 Newey W and D McFadden Large Sample Estimation and Hypothesis Testing In Engle R and D McFadden eds Handbook of Econometrics Vol IV Chapter 36 1994 Newey W J Powell and J Walker Semiparametric Estimation of Selection Models American Economic Review 80 1990 pp 324 328 Newey W Two Step Series Estimation of Sample Selection Models Department of Economics MIT Manuscript 1991 Newey W and K West A Simple Positive Semi De nite Heteroscedasticity and Autocorrelation Consistent Covariance Matrix Econometrica 55 1987a pp 703 708 Newey W and K West Hypothesis Testing with Ef cient Method of Moments Estimation International Economic Review 28 1987b pp 777 787 New York Post America s New Big Wheels of Fortune May 22 1987 p 3 Neyman J and E Scott Consistent Estimates Based on Partially Consistent Observations Econometrica 16 1948 pp 1 32

    Nickell S Biases in Dynamic Models with Fixed Effects Econometrica 49 1981 pp 1417 1426 Oaxaca R Male Female Wage Differentials in Urban Labor Markets International Economic Review 14 1973 pp 693 708 Oberhofer W and J Kmenta A General Procedure for Obtaining Maximum Likelihood Estimates in Generalized Regression Models Econometrica 42 1974 pp 579 590 Ohtani K and M Kobayashi A Bounds Test for Equality Between Sets of Coef cients in 2 Linear Regression Models Under Heteroscedasticity Econometric Theory 2 1986 pp 220 231 Ohtani K and T Toyoda Estimation of Regression Coef cients After a Preliminary Test for Homoscedasticity Journal of Econometrics 12 1980 pp 151 159 Ohtani K and T Toyoda Small Sample Properties of Tests of Equality Between Sets of Coef cients in Two Linear Regressions Under Heteroscedasticity International Economic Review 26 1985 pp 37 44 Olsen R A Note on the Uniqueness of the Maximum Likelihood Estimator in the Tobit Model Econometrica 46 1978 pp 1211 1215 Orcutt G S Caldwell and R Wertheimer Policy Exploration Through Microanalytic Simulation Washington D C Urban Institute 1976 Orme C Double and Triple Length Regressions for the Information Matrix Test and Other Conditional Moment Tests Mimeo University of York U K Department of Economics 1990 Orme C Nonnested Tests for Discrete Choice Models Working Paper Department of Economics University of York 1994 Osterwald Lenum M A Note on Quantiles of the Asymptotic Distribution of the Maximum Likelihood Cointegration Rank Test Statistics Oxford Bulletin of Economics and Statistics 54 1992 pp 461 472

    Greene 50240

    book

    June 7 2002

    22 36

    References

    987

    Pagan A and A Ullah The Econometric Analysis of Models with Risk Terms Journal of Applied Econometrics 3 1988 pp 87 105 Pagan A and A Ullah Nonparametric Econometrics Cambridge Cambridge University Press 1999 Pagan A and F Vella Diagnostic Tests for Models Based on Individual Data A Survey Journal of Applied Econometrics 4 Supplement 1989 pp S29 S59 Pagan A and M Wickens A Survey of Some Recent Econometric Methods Economic Journal 99 1989 pp 962 1025 Pagano M and M Hartley On Fitting Distributed Lag Models Subject to Polynomial Restrictions Journal of Econometrics 16 1981 pp 171 198 Pakes A and D Pollard Simulation and the Asymptotics of Optimization Estimators Econometrica 57 1989 pp 1027 1058 Park R R Sickles and L Simar Semiparametric Ef cient Estimation of Panel Data Models with AR 1 Errors Demartment of Economics Rice University manuscript 2000 Parks R Ef cient Estimation of a System of Regression Equations When Disturbances Are Both Serially and Contemporaneously Correlated Journal of the American Statistical Association 62 1967 pp 500 509 Patterson K An Introduction to Applied Econometrics New York St Martin s Press 2000 Pesaran H On the General Problem of Model Selection Review of Economic Studies 41 1974 pp 153 171 Pesaran M The Limits to Rational Expectations Blackwell Oxford 1987 Pesaran H and A Deaton Testing NonNested Nonlinear Regression Models Econometrica 46 1978 pp 677 694 Pesaran M and A Hall Tests of NonNested Linear Regression Models Subject to Linear Restrictions Economics Letters 27 1988 pp 341 348

    Pesaran M and B Pesaran A Simulation Approach to the Problem of Computing Cox s Statistic for Testing Nonnested Models Journal of Econometrics 57 1993 pp 377 392 Pesaran M and P Schmidt Handbook of Applied Econometrics Volume II Microeconomics London Blackwell Publishers 1997 Pesaran H and M Weeks Nonnested Hypothesis Testing An Overview in Baltagi B ed A Companion to Theoretical Econometrics Blackwell Oxford 2001 Petersen T Fitting Parametric Survival Models with Time Dependent Covariates Journal of the Royal Statistical Society Series C Applied Statistics 35 1986 pp 281 288 Petersen D and D Waldman The Treatment of Heteroscedasticity in the Limited Dependent Variable Model Mimeo University of North Carolina Chapel Hill November 1981 Phillips A Stabilization Policies and the Time Form of Lagged Responses Economic Journal 67 1957 pp 265 277 Phillips P Exact Small Sample Theory in the Simultaneous Equations Model In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 1 Amsterdam North Holland 1983 Phillips P Understanding Spurious Regressions Journal of Econometrics 33 1986 pp 311 340 Phillips P Time Series Regressions with a Unit Root Econometrica 55 1987 pp 277 301 Phillips P and S Ouliaris Asymptotic Properties of Residual Based Tests for Cointegration Econometrica 58 1990 pp 165 193 Phillips P and P Perron Testing for a Unit Root in Time Series Regression Biometrika 75 1988 pp 335 346 Poirier D The Econometrics of Structural Change Amsterdam North Holland 1974

    Greene 50240

    book

    June 7 2002

    22 36

    988

    References

    Poirier D The Use of the Box Cox Transformation in Limited Dependent Variable Models Journal of the American Statistical Association 73 1978b pp 284 287 Poirier D Partial Observability in Bivariate Probit Models Journal of Econometrics 12 1980 pp 209 217 Poirier D ed Bayesian Empirical Studies in Economics and Finance Journal of Econometrics 49 1991 pp 1 304 Poirier D Intermediate Statistics and Econometrics Cambridge MIT Press 1995 pp 1 217 Poirier D and A Melino A Note on the Interpretation of Regression Coef cients Within a Class of Truncated Distributions Econometrica 46 1978 pp 1207 1209 Powell J Least Absolute Deviations Estimation For Censored and Truncated Regression Models Technical report 356 Stanford University IMSSS 1981 Powell J Least Absolute Deviations Estimation for the Censored Regression Model Journal of Econometrics 25 1984 pp 303 325 Powell M An Ef cient Method for Finding the Minimum of a Function of Several Variables Without Calculating Derivatives Computer Journal 1964 pp 165 172 Prais S and H Houthakker The Analysis of Family Budgets New York Cambridge University Press 1955 Prais S and C Winsten Trend Estimation and Serial Correlation Cowles Commission Discussion Paper No 383 Chicago 1954 Prentice R and L Gloeckler Regression Analysis of Grouped Survival Data with Application to Breast Cancer Data Biometrics 34 1978 pp 57 67 Press W B Flannery S Teukolsky and W Vetterling Numerical Recipes The Art of Scienti c Computing Cambridge Cambridge University Press 1986 Quandt R Econometric Disequilibrium Models Econometric Reviews 1 1982 pp 1 63

    Quandt R Computational Problems and Methods In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 1 Amsterdam North Holland 1983 Quandt R The Econometrics of Disequilibrium New York Basil Blackwell 1988 Quandt R and J Ramsey Estimating Mixtures of Normal Distributions and Switching Regressions Journal of the American Statistical Association 73 December 1978 pp 730 738 Quester A and W Greene Divorce Risk and Wives Labor Supply Behavior Social Science Quarterly 63 1982 pp 16 27 Raftery A and S Lewis How Many Iterations in the Gibbs Sampler In J Bernardo et al eds Proceedings of the Fourth Valencia International Conference on Bayesian Statistics New York Oxford University Press 1992 pp 763 774 Raj B and B Baltagi eds Panel Data Analysis Heidelberg Physica Verlag 1992 Rao C Information and Accuracy Attainable in Estimation of Sstatistical Parameters Bulletin of the Calcutta Mathematical Society 37 1945 pp 81 91 Rao C Linear Statistical Inference and Its Applications New York John Wiley and Sons 1973 Rasch G Probabilistic Models for Some Intelligence and Attainment Tests Denmark Paedogiska Copenhagen 1960 Reirs l O Identi ability of a Linear Relation Between Variables Which Are Subject to Error Econometrica 18 1950 pp 375 389 Revankar N Some Finite Sample Results in the Context of Two Seemingly Unrelated Regression Equations Journal of the American Statistical Association 69 1974 pp 187 190 Revankar N Use of Restricted Residuals in SUR Systems Some Finite Sample Results Journal of the American Statistical Association 71 1976 pp 183 188 Revelt D and K Train Incentives for Appliance Ef ciency Random Parameters Logit Models of Households Choices

    Greene 50240

    book

    June 7 2002

    22 36

    References

    989

    Manuscript Department of Economics University of California Berkeley 1996 Ridder G and T Wansbeek Dynamic Models for Panel Data In R van der Ploeg ed Advanced Lectures in Quantitative Economics New York Academic Press 1990 pp 557 582 Rivers D and Q Vuong Limited Information Estimators and Exogeneity Tests for Simultaneous Probit Models Journal of Econometrics 39 1988 pp 347 366 Roberts G Convergence Diagnostics of the Gibbs Sampler In J Bernardo et al eds Proceedings of the Fourth Valencia International Conference on Bayesian Statistics New York Oxford University Press 1992 pp 775 782 Robinson C and N Tomes Self Selection and Interprovincial Migration in Canada Canadian Journal of Economics 15 1982 pp 474 502 Robinson P Semiparametric Econometrics A Survey Journal of Applied Econometrics 3 1988 pp 35 51 Rogers W Calculation of Quantile Regression Standard Errors Stata Technical Bulletin No 13 Stata Corporation College Station TX 1993 Rosenblatt D Remarks on Some Nonparametric Estimates of a Density Function Annals of Mathematical Statistics 27 1956 pp 832 841 Rosett R and F Nelson Estimation of the Two Limit Probit Regression Model Econometrica 43 1975 pp 141 146 Rubin H Consistency of Maximum Likelihood Estimators in the Explosive Case In T Koopmans ed Statistical Inference in Dynamic Economic Models New York John Wiley and Sons 1950 Ruud P A Score Test of Consistency Manuscript Department of Economics University of California Berkeley 1982 Ruud P Tests of Speci cation in Econometrics Econometric Reviews 3 1984 pp 211 242 Ruud P Consistent Estimation of Limited Dependent Variable Models De

    spite Misspeci cation of the Distribution Journal of Econometrics 32 1986 pp 157 187 Ruud P An Introduction to Classical Econometric Theory Oxford Oxford University Press 2000 Salem D and T Mount A Convenient Descriptive Model of the Income Distribution Econometrica 42 6 1974 pp 1115 1128 Savin E and K White The Durbin Watson Test for Serial Correlation with Extreme Sample Sizes or Many Regressors Econometrica 45 8 1977 pp 1989 1996 Sawtooth Software The CBC HB Module for Hierarchical Bayes Estimation http www sawtoothsoftware com Techabs htm 1999 Schimek M ed Smoothing and Regression Approaches Computation and Applications New York John Wiley and Sons 2000 Schmidt P Econometrics New York Marcel Dekker 1976 Schmidt P Estimation of Seemingly Unrelated Regressions with Unequal Numbers of Observations Journal of Econometrics 5 1977 pp 365 377 Schmidt P and R Sickles Some Further Evidence on the Use of the Chow Test Under Heteroscedasticity Econometrica 45 1977 pp 1293 1298 Schmidt P and R Sickles Production Frontiers and Panel Data Journal of Business and Economic Statistics 2 1984 pp 367 374 Schmidt P and R Strauss The Prediction of Occupation Using Multinomial Logit Models International Economic Review 16 1975a pp 471 486 Schmidt P and R Strauss Estimation of Models with Jointly Dependent Qualitative Variables A Simultaneous Logit Approach Econometrica 43 1975b pp 745 755 Schwert W Tests for Unit Roots A Monte Carlo Investigation Journal of Business and Economic Statistics 7 1989 pp 147 159

    Greene 50240

    book

    June 7 2002

    22 36

    990

    References

    Seaks T and K Layson Box Cox Estimation with Standard Econometric Problems Review of Economics and Statistics 65 1983 pp 160 164 Sepanski J On a Random Coef cients Probit Model Communications in Statistics Theory and Methods 29 2000 pp 2493 2505 Shapiro M and M Watson Sources of Business Cycle Fluctuations In O Blanchard and S Fischer eds NBER Macroeconomics Annual MIT Press Cambridge 1988 pp 111 148 Sharpe W Capital Asset Prices A Theory of Market Equilibrium Under Conditions of Risk Journal of Finance 19 1964 pp 425 442 Shaw D On Site Samples Regression Problems of Nonnegative Integers Truncation and Endogenous Strati cation Journal of Econometrics 37 1988 pp 211 223 Shephard R The Theory of Cost and Production Princeton Princeton University Press 1970 Shumway R Applied Statistical Time Series Englewood Cliffs N J Prentice Hall 1988 Sickles R D Good and R Johnson Allocative Distortions and the Regulatory Transition of the Airline Industry Journal of Econometrics 33 1986 pp 143 163 Sickles R B Park and L Simar Semiparametric Ef cient Estimation of Panel Models with AR 1 Errors Manuscript Department of Economics Rice University 2000 Silk J Systems Estimation A Comparison of SAS Shazam and TSP Journal of Applied Econometrics 11 1996 pp 437 450 Silva J A Score Test for Non Nested Hypotheses with Applications to Discrete Response Models Journal of Applied Econometrics 16 5 2001 pp 577 598 Silver J and M Ali Testing Slutsky Symmetry In Systems of Linear Demand Equations Journal of Econometrics 41 1989 pp 251 266

    Sims C Money Income and Causality American Economic Review 62 1972 pp 540 552 Sims C Exogeneity and Causal Ordering in Macroeconomic Models In New Methods in Business Cycle Research Proceedings from a Conference Federal Reserve Bank of Minneapolis 1977 pp 23 43 Sims C Macroeconomics and Reality Econometrica 48 1 1980 pp 1 48 Smith V Selection and Recreation Demand American Journal of Agricultural Economics 70 1988 pp 29 36 Solow R Technical Change and the Aggregate Production Function Review of Economics and Statistics 39 1957 pp 312 320 Sowell F Optimal Tests of Parameter VCariation in the Generalized Method of Moments Framework Econometrica 64 1996 pp 1085 1108 Spector L and M Mazzeo Probit Analysis and Economic Education Journal of Economic Education 11 1980 pp 37 44 Spencer D and K Berk A Limited Information Speci cation Test Econometrica 49 1981 pp 1079 1085 Srivistava V and T Dwivedi Estimation of Seemingly Unrelated Regression Equations A Brief Survey Journal of Econometrics 10 1979 pp 15 32 Srivistava V and D Giles Seemingly Unrelated Regression Models Estimation and Inference New York Marcel Dekker 1987 Staiger D J Stock and M Watson How Precise are Estimates of the Natural Rate of Unemployment NBER Working Paper Number 5477 Cambridge 1996 Staiger D and J Stock Instrumental Variables Regression with Weak Instruments Econometrica 65 1997 pp 557 586 Stata Stata User s Guide College Station Tex Stata Press 2001 Stern S Two Dynamic Discrete Choice Estimation Problems and Simulation Method Solutions Review of Economics and Statistics 76 1994 pp 695 702

    Greene 50240

    book

    June 7 2002

    22 36

    References

    991

    Stock J Unit Roots Structural Breaks and Trends In R Engle and D McFadden eds Handbook of Econometrics Vol 4 Amsterdam North Holland 1994 Stock J and M Watson Testing for Common Trends Journal of the American Statistical Association 83 1988 pp 1097 1107 Stock J and M Watson Forecasting Output and In ation The Role of Asset Prices NBER Working Paper 8180 Cambridge Mass 2001 Stoker T Consistent Estimation of Scaled Coef cients Econometrica 54 1986 pp 1461 1482 Stone R The Measurement of Consumers Expenditure and Behaviour in the United Kingdom 1920 1938 Cambridge Cambridge University Press 1954a Stone R Linear Expenditure Systems and Demand Analysis An Application to the Pattern of British Demand Economic Journal 64 1954b pp 511 527 Strang G Linear Algebra and Its Applications New York Academic Press 1988 Strickland A and L Weiss Advertising Concentration and Price Cost Margins Journal of Political Economy 84 1976 pp 1109 1121 Stuart A and S Ord Kendall s Advanced Theory of Statistics New York Oxford University Press 1989 Suits D Dummy Variables Mechanics vs Interpretation Review of Economics and Statistics 66 1984 pp 177 180 Susin S Hazard Hazards The Inconsistency of the Kaplan Meier Empirical Hazard and Some Alternatives Manuscript U S Census Bureau 2001 Swamy P Ef cient Inference in a Random Coef cient Regression Model Econometrica 38 1970 pp 311 323 Swamy P Statistical Inference in Random Coef cient Regression Models New York Springer Verlag 1971 Swamy P Linear Models with Random Coef cients In P Zarembka ed Frontiers in Econometrics New York Academic Press 1974

    Swamy P and G Tavlas Random Coef cients Models Theory and Applications Journal of Economic Surveys 9 1995 pp 165 182 Swamy P and G Tavlas Random Coef cient Models In B Baltagi ed A Companion to Theoretical Econometrics Oxford Blackwell 2001 Tanner M Tools for Statistical Inference 2nd ed New York Springer Verlag 1993 Taqqu M Weak Convergence to Fractional Brownian Motion and the Rosenblatt Process Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete 31 1975 pp 287 302 Tauchen H A Witte and H Griesinger Criminal Deterrence Revisiting the Issue with a Birth Cohort Review of Economics and Statistics 3 1994 pp 399 412 Taylor L Estimation by Minimizing the Sum of Absolute Errors In P Zarembka ed Frontiers in Econometrics New York Academic Press 1974 Taylor W Small Sample Properties of a Class of Two Stage Aitken Estimators Econometrica 45 1977 pp 497 508 Telser L Iterative Estimation of a Set of Linear Regression Equations Journal of the American Statistical Association 59 1964 pp 845 862 Terza J Ordinal Probit A Generalization Communications in Statistics 14 1985a pp 1 12 Terza J A Tobit Type Estimator for the Censored Poisson Regression Model Economics Letters 18 1985b pp 361 365 Terza J Estimating Count Data Models with Endogenous Switching and Sample Selection Department of Economics Penn State University Working Paper IPRE95 14 1995 Terza J Estimating Count Data Models with Endogenous Switching Sample Selection and Endogenous Treatment Effects Journal of Econometrics 84 1 1998 pp 129 154 Terza J and D Kenkel The Effect of Physician Advice on Alcohol Consumption Count Regression with an Endogenous

    Greene 50240

    book

    June 7 2002

    22 36

    992

    References

    Treatment Effect Journal of Applied Econometrics 16 2 2001 pp 165 184 Theil H Economic Forecasts and Policy Amsterdam North Holland 1961 Theil H Principles of Econometrics New York John Wiley and Sons 1971 Theil H Linear Algebra and Matrix Methods in Econometrics In Z Griliches and M Intriligator eds Handbook of Econometrics Vol 1 New York North Holland 1983 Theil H and A Goldberger On Pure and Mixed Estimation in Economics International Economic Review 2 1961 pp 65 78 Thursby J Misspeci cation Heteroscedasticity and the Chow and Goldfeld Quandt Tests Review of Economics and Statistics 64 1982 pp 314 321 Tobin J Estimation of Relationships for Limited Dependent Variables Econometrica 26 1958 pp 24 36 Toyoda T and K Ohtani Testing Equality Between Sets of Coef cients After a Preliminary Test for Equality of Disturbance Variances in Two Linear Regressions Journal of Econometrics 31 1986 pp 67 80 Train K Halton Sequences for Mixed Logit Manuscript Department of Economcis University of California Berkeley 1999 Train K A Comparison of Hierarchical Bayes and Maximum Simulated Likelihood for Mixed Logit Manuscript Department of Economics University of California Berkeley 2001 Train K Discrete Choice Methods with Simulation Cambridge Cambridge University Press 2002 Trivedi P and A Pagan Polynomial Distributed Lags A Uni ed Treatment Economic Studies Quarterly 30 1979 pp 37 49 Tsay R Analysis of Financial Time Series New York John Wiley and Sons 2002 Tunali I A General Structure for Models of Double Selection and an Application to a Joint Migration Earnings Process with

    Remigration Research in Labor Economics 8 1986 pp 235 282 United States Department of Commerce Statistical Abstract of the United States Washington DC U S Government Printing Of ce 1979 United States Department of Commerce Bureau of Economic Analysis National Income and Product Accounts Survey of Current Business Business Statistics 1984 Washington DC U S Government Printing Of ce 1984 Veall M Bootstrapping the Probability Distribution of Peak Electricity Demand International Economic Review 28 1987 pp 203 212 Veall M Bootstrapping the Process of Model Selection An Econometric Example Journal of Applied Econometrics 7 1992 pp 93 99 Veall M and K Zimmermann PseudoR2 in the Ordinal Probit Model Journal of Mathematical Sociology 16 1992 pp 333 342 Vinod H Bootstrap Jackknife Resampling and Simulation Applications in Econometrics in Maddala G C Rao and H Vinod eds Handbook of Statistics Econometrics Vol II Chapter 11 Amsterdam North Holland 1993 Vinod H and B Raj Economic Issues in Bell System Divestiture A Bootstrap Application Applied Statistics Journal of the Royal Statistical Society Series C 37 2 1994 pp 251 261 Vuong Q Likelihood Ratio Tests for Model Selection and Non nested Hypotheses Econometrica 57 1989 pp 307 334 Vytlacil E A Aakvik and J Heckman Treatment Effects for Discrete Outcomes When Responses to Treatments Vary Among Observationally Identical Persons An Applicaton to Norwegian Vocational Rehabilitation Programs Journal of Econometrics 2002 forthcoming Waldman D A Note on the Algebraic Equivalence of White s Test and a Variant of the Godfrey Breusch Pagan Test for

    Greene 50240

    book

    June 7 2002

    22 36

    References

    993

    Heteroscedasticity Economics Letters 13 1983 pp 197 200 Wallace T and A Hussain The Use of Error Components in Combining Cross Section with Time Series Data Econometrica 37 1969 pp 55 72 Wang P I Cockburn and M Puterman Analysis of Panel Data A Mixed Poisson Regression Model Approach Journal of Business and Economic Statistics 16 1 1998 pp 27 41 Watson M Vector Autoregressions and Cointegration In R Engle and D McFadden eds Handbook of Econometrics Vol 4 Amsterdam North Holland 1994 Wedel M and W DeSarbo J Bult and V Ramaswamy A Latent Class Poisson Regression Model for Heterogeneous Count Data Journal of Applied Econometrics 8 1993 pp 397 411 Weeks M Testing the Binomial and Multinomial Choice Models Using Cox s Nonnested Test Journal of the American Statistical Association Papers and Proceedings 1996 pp 312 328 Weiss A Asymptotic Theory for ARCH Models Stability Estimation and Testing Discussion Paper 82 36 Department of Economics University of California San Diego 1982 Weiss A Simultaneity and Sample Selection in Poisson Regression Models Manuscript Department of Economics University of Southern California 1995 West K On Optimal Instrumental Variables Estimation of Stationary Time Series Models International Economic Review 42 4 2001 pp 1043 1050 White H A Heteroscedasticity Consistent Covariance Matrix Estimator and a Direct Test for Heteroscedasticity Econometrica 48 1980b pp 817 838 White H Maximum Likelihood Estimation of Misspeci ed Models Econometrica 53 1982a pp 1 16 White H Instrumental Variables Regression with Independent Observations Econometrica 50 2 1982b pp 483 500

    White H ed Non Nested Models Journal of Econometrics 21 1 1983 pp 1 160 White H Asymptotic Theory for Econometricians Revised New York Academic Press 2001 Wickens M A Note on the Use of Proxy Variables Econometrica 40 1972 pp 759 760 Willis R and S Rosen Education and SelfSelection Journal of Political Economy 87 1979 pp S7 S36 Windmeijer F Goodness of Fit Measures in Binary Choice Models Econometric Reviews 14 1995 pp 101 116 Winkelmann R Econometric Analysis of Count Data 2nd ed Heidelberg Germany Springer Verlag 1997 Winkelmann R Econometric Analysis of Count Data Heidelberg Germany Springer Verlag 2000 Witte A Estimating an Economic Model of Crime with Individual Data Quarterly Journal of Economics 94 1980 pp 57 84 Wong W On the Consistency of Cross Validation in Kernel Nonparametric Regression Annals of Statistics 11 1983 pp 1136 1141 Woolridge J Speci cation Testing and Quasi Maximum Likelihood Estimation Journal of Econometrics 48 1 2 1991 pp 29 57 Woolridge J Selection Corrections for Panel Data Models Under Conditional Mean Assumptions Journal of Econometrics 68 1995 pp 115 132 Woolridge J Quasi Likelihood Methods for Count Data In M Pesaran and P Schmidt eds Handbook of Applied Econometrics Vol II Microeconomics London Blackwell Publishers 1997 Woolridge J Econometric Analysis of Cross Section and Panel Data Cambridge MIT Press 1999 Woolridge J Introductory Econometrics A Modern Approach New York Southwestern Publishers 2000

    Greene 50240

    book

    June 7 2002

    22 36

    994

    References

    Working E What Do Statistical Demand Curves Show Quarterly Journal of Economics 41 1926 pp 212 235 Wu D Alternative Tests of Independence Between Stochastic Regressors and Disturbances Econometrica 41 1973 pp 733 750 Wynand P and B van Praag The Demand for Deductibles in Private Health Insurance A Probit Model with Sample Selection Journal of Econometrics 17 1981 pp 229 252 Yatchew A Nonparametric Regression Techniques in Econometrics Journal of Econometric Literature 36 1998 pp 669 721 Yatchew A An Elementary Estimator of the Partial Linear Model Economics Letters 57 1997 pp 135 143 Yatchew A Scale Economies in Electricity Distribution Journal of Applied Econometrics 15 2 2000 pp 187 210 Yatchew A and Z Griliches Speci cation Error in Probit Models Review of Economics and Statistics 66 1984 pp 134 139 Zarembka P Transformations of Variables in Econometrics In P Zarembka ed Frontiers in Econometrics Boston Academic Press 1974 Zavoina R and W McElvey A Statistical Model for the Analysis of Ordinal Level Dependent Variables Journal of Mathematical Sociology Summer 1975 pp 103 120 Zellner A An Ef cient Method of Estimating Seemingly Unrelated Regressions and Tests of Aggregation Bias Journal of the American Statistical Association 57 1962 pp 500 509

    Zellner A Estimators for Seemingly Unrelated Regression Equations Some Exact Finite Sample Results Journal of the American Statistical Association 58 1963 pp 977 992 Zellner A Introduction to Bayesian Inference in Econometrics New York John Wiley and Sons 1971 Zellner A Statistical Analysis of Econometric Models Journal of the American Statistical Association 74 1979 pp 628 643 Zellner A Bayesian Econometrics Econometrica 53 1985 pp 253 269 Zellner A and D Huang Further Properties of Ef cient Estimators for Seemingly Unrelated Regression Equations International Economic Review 3 1962 pp 300 313 Zellner A J Kmenta and J Dreze Speci cation and Estimation of Cobb Douglas Production Functions Econometrica 34 1966 pp 784 795 Zellner A and C K Min Gibbs Sampler Convergence Criteria Journal of the American Statistical Association 90 1995 pp 921 927 Zellner A and N Revankar Generalized Production Functions Review of Economic Studies 37 1970 pp 241 250 Zellner A and A Siow Posterior Odds Ratios for Selected Regression Hypotheses with Discussion In J Bernardo M DeGroot D Lindley and A Smith eds Bayesian Statistics Valencia Spain University Press 1980 Zellner A and H Theil Three Stage Least Squares Simultaneous Estimation of Simultaneous Equations Econometrica 30 1962 pp 63 68

    Greene 50240

    gree50240 Au Ind

    July 16 2002

    21 24

    AUTHOR INDEX

    Q
    A
    Aakvik A 697n30 Abowd J 714n54 Abramovitz M 533 687n23 919n2 921 927 928 929 Adkins L 58n4 Af eck Graves J 356 A T 60n7 Ahn S 308 310 313 314 552 554 Aigner D 88 102n5 361 501 502 948 Aitchison J 854n4 Aitken A C 207 208 Akaike H 565 Akin J 709n48 Albert J 922n6 Aldrich J 666n2 667n3 Ali M 222n9 362n24 Allenby G 727 Allison P 747 Almon S 566 Altonji J 704n45 Alvarez R 286 320 333 Amemiya T 36n5 162n1 167 194n3 195 197 228n15 260 298n14 310 389n12 403n18 448n21 461 462n27 463 465 476 480n3 482 518n25 519 538n8 565n3 663 666n2 672 672n9 676 683n16 684n17 685n18 700 701 702 756n1 761n9 767 771n16 929 939n30 941n32 Anderson E 698 Anderson G 361n21 Anderson R 368n35 Anderson T 195 197 260 301n21 308n27 402n16 413 Andrews D 133n12 139 140 140n17 141 142 142n19 143 448n21 602 660 Aneuryn Evans G 156 Angrist J 4 425 787n28 789n32 Arabmazar A 768 Arellano M 308 309 310 314 315 554 690n27 697 697n30 698n31 708 Arrow K 366 Arsanjani F 773 790n34 Ashenfelter O 4 10 88 Att eld C 349n17 Avery R 690n26 694

    B
    Bai J 142 142n19 143 Baillie R 280 647n23 648 Balakrishnan N 759n5 849n1 Balestra P 298n15 301n21 308n27 Baltagi B 284n2 291n9 301n21 316 318 319 343n6 426n1 698n31 Baltagi G 341n3 Barnow B 788n31 Bartels R 343n6 361 Barten A 364n26 Bassett G 224 225 448 465 Battese G 298n14 Bazzara M 934n19 Beach C 273 274 Beck N 286 323 330 333 334 690n27 709 709n48 Beggs S 736n63 Bekker P 389n12 Bell W 608n1 Belsley D 58 61 943 Ben Akiva M 683n16 684 Ben Porath Y 284 Bera A 225 771 772n17 Berk K 414 Bernard J 579n8 Berndt E 12 103n6 360n19 366n29 366n30 367n33 368 369 369n36 481n4 492n10 498 501 672 712 940 942 950 Berry S 512 728 Bertschek I 441 442 515 694 700n35 951 Berzeg K 299n17 Beyer A 658 659 660 Bhargava A 300n20 308n27 314n32 Bhat C 514 727 728 Bickel P 447 Billingsley P 74n8 261n6 Binkley J 343n9 Birkes D 448n21 Bissey M E 703n43

    Black F 352 354 355 356 357 Blanchard O 598 599n14 Blundell R 312n30 361n21 673 679n13 768n13 Bockstael N 773 790n34 Bollerslev T 238 240 241 242 244 245 245n32 246n34 249 949 Bond S 308 309 310 312n30 Boot J 329n39 348 B rsch Supan A 714n55 932n18 Boskin M 719 Bover O 308 309 310 314 554 697n30 Box G 173n3 257n5 269 272n14 274 608 608n1 619 620 621 622 921 Boyes W 673 685 713 Brannas K 749 790n34 Bray T 921 Breslaw J 932n17 Breusch T 223 224n11 231 249 269 270 298 318n34 327 350 506 769 Brock W 669n5 Brown B 135 362n23 Brown C 768 Brown J 854n4 Brundy J 79 Bult J 441 Burnett N 715 716 Burnside C 550n10 Buse A 209n13 484n5 Butler J 690 692 693 Butler R 448n22

    C
    Cain G 788n31 Caldwell S 766 Calzolari G 239n24 244n29 Cameron A 743 744 745 746 Cameron C 740n66 741n68 742 748 Campbell J 5 351n18 356 549 608n1 Cardell S 736n63 Carlin B 922n6 Carson R 774 Casella G 922 923n7

    993

    Greene 50240

    gree50240 Au Ind

    July 16 2002

    21 24

    994

    Author Index 179 182 190 196n4 220 243 261 263n7 265 267n10 343n7 382n6 401 403 461 462 465 474 480 496 501 534n3 621n12 632n14 635n15 655 656 678 679n13 681 682 Deaton A 154n5 156 362n23 366n30 Debreu G 501 Dempster A 774 Dent W 592n10 DeSarbo W 441 Deschamps P 361n21 deWitt G 329n39 348 Dezhbaksh H 270n13 Dhrymes P 74n8 406n20 408 414n27 573n7 663 756n1 Dickey D 572 573 602 608n1 637 638a 646 Diebold F 159 160 558n1 565n3 587n9 595 608n1 Dielman T 284n2 Diewert E 366n29 366n31 Diggle R 690n26 DiNardo J 449 Ding Z 647 Doan T 600 Dodge Y 448n21 Doksum K 447 Domowitz I 238 Doob J 261n6 Doornik J 585 Dreze J 125 Duncan G 199n6 220n5 768n13 771n15 789 Durbin J 80 135 270 270n12 273 275 414n27 Durlauf S 669n5 Dwivedi T 340n2 342 Evans J 135 Evans M 849n1

    Caudill S 666n2 Caves D 174 Cecchetti S 4 588 596 596n13 599n14 600 602 639 640 Chamberlain G 300n20 301n22 436n8 534n3 602 697n29 697n30 698 699 708 709 799 Chambers R 362n23 Charlier E 702n41 Chat eld C 624 628 Chavez J 362n23 Chen T 697n30 Chenery H 366 Chesher A 671n8 771 772n17 776 793 Cheung C 702 761 768n12 Cheung Y 647 649 Chiappori R 708 Chib S 426 446 922n6 Chintagunta P 728 Chou R 240 Chow G 130n8 318n34 Christensen Associates 949 Christensen L 12 91 92 103n6 126 127 162 174 362n24 364 366n29 366n30 366n31 367n33 368n34 377 451 452 524 948 950 Cleveland W 458 Cochrane D 273 274 275 318 360n20 Cockburn I 441 Conlen J 446 Conniffe D 341n3 343n8 Conway D 145 146 Cooley T 379n2 Cornwell C 287n5 Coulson N 238 Council of Economic Advisors 946 947 Cox D 153 156 173n3 686n22 794n37 799 Cragg J 86n13 199n8 216n2 238 401 411n25 413n26 683n16 749n74 770 776 Cram r H 476 479 Cramer J 683n16 684 685 719 Cumby R 401 409 534n3

    F
    Fair R 113n10 133n12 761 774 775 780 952 Farber H 714n54 Farrell M 501 Feiberg D 361 Feibig D 340n2 343n6 347n13 361 Feldstein M 113 Fernandez A 669n5 673 704 Fernandez L 771n15 Ferrier G 935n25 Fin T 770 Finney D 686n22 Fiorentini G 239n24 244n29 Fisher F 132 Fisher G 155n7 680n14 Fisher R 428 525 Flannery B 705n47 833n11 919n2 920 928 929 934n23 935n24 935n25 Fletcher R 934n19 935 938n29 Florens J 790n35 Flores Lagunes A 663 Fomby T 174n6 228n17 301n21 322n36 666n2 912 Fougere D 790n35 French K 240 Friedman M 8 Frisch R 1 27 Fry J 362n23 Fry T 362n23 Fuller W 228n16 272n14 298n14 572 573 602 624 637 638a 638b 640a 646

    G
    Gabrielsen A 389n12 Gali J 596n13 602 Gallant A 382n6 Gallant R 382n6 Garber S 86n13 121n2 Garrett G 286 320 333 Garvin S 341n3 Gaver K 155 Geisel M 155 Gelfand A 922n6 Gelman A 446 Gentle J 919n2 George E 922 923n7 Ger n M 704 705 Geweke J 565n3 592n10 608n1 647n23 648 690n26 714n55 930n14 932n17 932n18 Ghysels E 238 245 Giaccotto C 222n9

    E
    Efron B 683n16 684 685 719 924n9 925 Eichenbaum M 550n10 Eicker F 199n6 220n5 Elashoff R 60n7 Elliot G 645 Enders W 558n1 608n1 617 621 Engle R 216n2 238 238n21 239 240 242 249 381n5 382 558n1 591 647 650n25 654 655 656 657 658 659 679n13 Epstein D 690n27 709 709n48 Ericsson T 659 Estes E 451n24 Evans G 636n20

    D
    Dahlberg M 314 551 553 588 604 605 951 Dale S 4 Dastoor N 154n5 Davidson J 426n1 540 Davidson R 82 154 155 156 162n1 164 167 170 178

    Greene 50240

    gree50240 Au Ind

    July 16 2002

    21 24

    Author Index Gilbert R 924n8 Giles D 340n2 344n12 Ginter J 727 Gloeckler L 697n29 Godfrey L 158 269 270 484 496 772n17 Goffe W 935n25 Goldberger A 280n16 343n8 359 360 387n10 411n24 419n30 436n9 701 702 761 768n12 771n15 788n30 788n31 789 837 898 Goldfeld S 162n1 180n12 223 249 403 933 938 938n28 942n33 Good D 118 Gordin M 264n8 265 Gourieroux C 152n3 154n5 246n34 512n24 514 518n25 521 701n38 772n17 932n17 Granger C 153n4 381n5 558n1 579 592 608n1 621 624 632 633 635 647 647n23 650n25 654 655 656 657 Greenberg E 426 446 907n4 912 922n6 Greene W 71n6 91 92 100n4 118 118n1 126 127 186 287n5 319 364 366n31 368n34 377 429 441 442 451 452 501n18 514 524 667n3 673 690 690n27 693 696 698n31 701 702 702n40 703n44 712n52 713 715 716 727 729 739 740n66 741n68 745n71 749 750 751 761 761n8 764n10 766 768n12 774 776 784n23 785n25 790 790n34 797n39 799 855n6 917n8 920 929 948 949 950 951 952 Greenstadt J 938 Griesinger H 693 Grif n W 316 Grif ths W 58 67n2 150 167 177n9 209n13 239n24 280 290 295n12 297n13 301n21 344n12 347n13 360n20 395n14 413n26 430n2 434n6 436n8 462n27 565n3 608n1 Griliches Z 59 60n7 86n13 211 274 329n39 573n7 679 743 744 747 797n41 924n8 Grogger J 774 Gronau R 782n21 Grunfeld Y 329n39 340 Gruvaeus G 943n35 945 Guilkey D 360n19 367n33 709n48 Gurmu S 58 741n68 743 743n70 744

    995

    H
    Haavelmo T 378 Haitovsky Y 60n7 Hajivassiliou V 714n55 932n18 Hakkio C 238 Hall A 140 142 158 159n10 949 Hall B 481n4 498 501 672 712 743 744 747 767n11 797n41 919n1 940 942 942n33 943n35 Hall R 481n4 498 501 525 526 548 549 672 712 940 942 Hamilton J 263 265 280 558n1 576 578 586 592n11 594n12 595 608n1 621n12 624 637n21 646 650n25 654 Han A 800 Hansen B 142n18 Hansen L 80n9 134 142n18 526 534n3 537 537n7 538 538n8 542 545 549 690n26 694 H rdle W 448n21 453n26 458 704n46 Hartley M 565n4 Harvey A 223 226n13 232 233 236 242 243 249 558n1 573n7 576 608n1 632n14 680 Hashimoto N 344n10 Hastings N 849n1 Hatanaka M 277 624 Hausman J 80 81 82 213 301 301n22 303 304 305 308 310 387n10 395n14 408 414 481n4 498 501 672 680n14 699 712 724 725 736n63 743 744 747 761n8 797n41 800 940 942 Hayashi F 140 200n9 264n8 265 426n1 465 534n3 540 914n5 Heaton J 537n7 542 Heckman J 4 440n15 666n2 672n10 697 697n30 708 709 761n7 780n18 782n21 784 785 789 790n35 855n7 Heilbron D 750 Hendry D 151 238 278 381n5 382 558n1 583 585 591 608n1 658 659 Hensher D 719 723 726 729 729n62 951 Heshmati A 361

    Hildebrand G 102n5 948 Hildreth C 273 318n34 Hill C 58 58n4 67n2 150 167 174n6 177n9 209n13 228n17 239n24 280 290 295n12 297n13 301n21 322n36 344n12 347n13 360n20 395n14 413n26 430n2 434n6 436n8 462n27 565n3 608n1 666n2 912 Hite S 59n5 Hoffman D 673 685 713 Holly A 382n6 Holt M 360n19 Holtz Eakin D 313n31 314 554 602 Honore B 451n24 704n45 708 709 789 Horn A 199n6 220n5 Horn D 199n6 220n5 Horowitz J 470 673 680n14 702n41 703n42 704 705 772n17 Hosking J 647n23 Hotz J 690n26 694 Houck C 318n34 Houthakker H 215n1 Hsiao C 284n2 299n16 301n21 308n27 318n34 385n8 690n27 697 699n34 Huang D 344n11 Huber P 448 518n25 519 Huizinga J 401 409 534n3 Hurd M 768 Hurst H 647n23 Hussain A 298n14 Hwang H 341n3 Hyslop D 84n10 709 710 714n55 715

    I
    Im E 341n3 Imbens G 84n10 Imhof J 271 Inkmann J 694 Irish M 671n8 771 772n17 776 793

    J
    Jackman S 690n27 709 709n48 Jain D 728 Jakubson G 709 Jansen D 608n1 Jarque C 225 762 771 772n17 Jayatissa W 134n13 Jenkins G 257n5 272n14 608 608n1 619 620 621 Jennrich R I 162n1

    Greene 50240

    gree50240 Au Ind

    July 16 2002

    21 24

    996

    Author Index Kim I 645 Kim M 949 Kiviet J 307n26 361n21 Klein L 381 950 Klein R 704 705 706 Klepper S 86n13 Kmenta J 125 129 212 229 236 248 299 322n36 327 347 349n16 367n33 917n9 924n8 Knapp L 680n14 Knight F 501 Kobayashi M 134 134n13 Koenker R 224 225 448 449n23 450 465 Kotz S 672n10 744 750 759n5 781n20 849n1 854n3 Kraft D 242 249 Krailo M 698n31 Kroner K 240 Krueger A 4 10 88 Kuh E 58 61 Kumbhakar S 361 501n18 Kyriadizou T 708 709 789 Kyriazidou E 702n41 Lerman R 714 932n16 Lerman S 673 683n16 684 LeRoy S 379n2 Levi M 86n13 Levinsohn J 512 728 Lewbel A 452 453 704n45 708 Lewis H 782n21 Lewis S 923n7 Li W 238n21 Liang K 690n26 Liang P 690n26 Lilien D 238 240 Lillard L 317n33 Ling S 238n21 Lintner J 352 353 356 Litterman R 586 595 Little S 683n16 684 Liu T 102n5 413 414 948 Ljung G 269 622 Lo A 5 351n18 356 647 Loeve M 910 Long S 5 756n1 Longley J 58 948 Louviere J 726 Lovell K 102n5 367n33 501 501n18 502 504 505 524 948 Low S 673 685 713 Lu J 273 Lucas R 587 658 659 Lumsdaine R 142n19 Lutkepohl H 558n1 589 590

    Jensen M 347n14 Jobson J 228n16 Johansen S 655 656 657 659 Johanssen P 749 Johansson E 314 551 553 588 604 605 951 Johnson N 672n10 744 750 759n5 781n20 849n1 854n3 Johnson R 58 118 324 Johnson S 174n6 228n17 301n21 322n36 666n2 912 Johnston J 402n16 449 917n8 Jondrow J 504 505 524 Jones J 708 Joreskog K 349n15 919n1 943n35 945 Jorgenson D 79 162 362n24 366n29 366n30 367n32 367n33 573 Journal of Applied Econometrics 426n1 553 951 Journal of Business and Economic Statistics 146 426n1 787n28 Journal of Econometrics 426n1 Joyeux R 647n23 Judd K 919n2 Judge G 58 67n2 150 167 168 177n9 209n13 239n24 280 290 295n12 297n13 301n21 344n12 347n13 360n20 395n14 413n26 426 430n2 434n6 436n8 462n27 518n25 540 565n3 608n1 666n2 878n1 937n27 941 Juselius K 657

    L
    L Ecuyer P 920 Lahiri K 284n2 Laird N 774 Laisney F 673 Lambert D 750 751 774 Lancaster T 671n8 697 756n2 772n17 790n35 794 794n37 797 Land K 441 Landers A 59n5 Landwehr J 708 Lange P 286 320 333 Lau L 162 362n24 366n30 367n33 Lawless J 799 Layson K 173n5 498n16 Leamer E 86n13 436n8 878n1 LeCam L 480n3 Lechner M 441 442 515 673 694 700n35 951 Lee L 283n1 284n2 743 771 772n17 Lee M 284n2 702n41 756n1 771n15 771n16 Lee T 58 67n2 150 167 177n9 209n13 239n24 280 290 295n12 297n13 301n21 344n12 347n13 360n20 395n14 413n26 430n2 434n6 436n8 565n3 608n1 Leff Nathaniel 40

    M
    MacDonald G 448n22 MacKinlay A 5 351n18 356 MacKinnon J 82 154 155 156 162n1 164 167 170 178 179 182 190 196n4 199n6 220 220n5 243 261 263n7 265 267n10 273 274 343n7 382n6 401 403 461 462 465 474 480 496 501 534n3 621n12 678 679n13 681 682 MaCurdy T 310 666n2 697 709 Maddala G 60n6 284n2 297n13 298n14 299n18 430n2 573n7 636n19 645 663 667n3 672n9 683n16 695 715 716 727 756n1 761n9 768 781n20 786 787n27 788n31 849n1 907n3 917n6 Magnac T 708 Magnus J 3n2 837n14 Malinvaud E 162n1 538n8 Mandy D 361

    K
    Kalb eisch J 771 790n35 794 795 797 799 Kamlich R 146 Kang H 635n17 Kaplan E 797 798 800 Katz J 286 323 330 333 334 Kay R 683n16 684 Keane M 690n26 709 714n55 932n17 932n18 Kelejian H 403 404 Kelly J 394n13 Kenkel D 790 Kennan J 794 800 952 Kennedy W 919n2 Kerman S 341n3 Keuzenkamp H 3n2 Keynes J 1 8 Kiefer N 225 712n51 756n2 790n35 794 797 799 800 Killian L 595 600 Kim H 703n42

    Greene 50240

    gree50240 Au Ind

    July 16 2002

    21 24

    Author Index Mankiw G 549 Mann H 74n8 260 636 Manski C 536 673 702n41 703n42 703n44 704 704n46 707 708 709 714 780n18 789 932n16 Marcus A 739 Mariano R 403 406 Markowitz H 352 353 Marsaglia G 921 Martins M 789 Martins Fillho C 361 Materov I 504 505 524 Matyas L 284n2 534n3 Matzkin R 704 704n45 Mazodier P 316 Mazzeo M 675 703 951 McAleer M 152n3 155n7 238n21 McCallum B 69n5 88 McConnell K 773 790n34 McCullagh P 147 743 745n71 746 747 952 McCullough B 113 170 238n21 239n24 245n32 246n34 576 579 McDonald B 356 McDonald J 448n22 766 McElroy M 345 McFadden D 319 461 465 482 512 534n3 540 550n10 558n1 663 683 683n16 684n17 714 719 724 724n60 727 932 932n17 McKelvey W 683n16 684 685 736 McKenzie C 684 McLaren K 360n19 362n23 Mead R 935n24 Meese R 565n3 592n10 Meier P 797 798 800 Melenberg B 448n22 702n41 708 762 768n13 770 771 Melino A 497 Merton R 240 Messer K 220n6 Meyer B 800 Miller D 168 177n9 426 430n2 518n25 540 Miller R 608n1 Mills T 238n21 558n1 608n1 621 Min C K 923n7 Minhas B 366 Mittelhammer R 168 177n9 426 430n2 518n25 540 Mizon G 153 154n5 253 278 360 361n21 582 Mof tt R 690 692 693 766 768 Monfort A 152n3 154n5 246n34 512n24 514 518n25 521 701n38 772n17 932n17 Moody s Industrial Manual 950 Moro D 360n19 Moschino G 360n19 Mouchart M 790n35 Mount T 299n18 855n5 Mroz T 51 786 947 Muellbauer J 362n23 366n30 Mullahy J 666n2 749 750 774 779 Muller M 921 Mundlak Y 294n10 Murphy K 184 509 510 785n26 Ord S 152n2 473 474 477 530n1 849n1 Orme C 243 683 Osterwald Lenum M 657 Ouliaris S 656

    997

    P
    Pagan A 223 224n11 231 240 249 298 318n34 327 350 453n26 456 460 465 505n20 506 507 534n3 565n4 608n1 769 771n15 772 772n17 776 880 Pagano M 565n4 Pakes A 512 714n55 728 Panattoni L 239n24 244n29 Panel Study of Income Dynamics 947 Park R 283n1 Parks R 360 Patterson K 426n1 558n1 608n1 621 Peacock B 849n1 Perron P 142n19 608n1 635n15 644 Pesaran B 682 Pesaran H 152n3 155 156 156n9 158 284n2 Pesaran M 153n4 158 159 159n10 501n18 682 949 Petersen D 768 Petersen T 792n36 Phillips A 251 Phillips G 223 361n21 Phillips P 401 634 635 635n15 644 656 Pierce D 269 274 622 Pike M 698n31 Ploberger W 140 141 142 143 602 660 Plosser C 636n18 Poirier D 121n2 426 430n2 497 714n54 849n1 878n1 930n14 Polachek S 146 Pollard D 714n55 Pollard J 703n42 Porter Hudak S 647n23 648 Powell J 448n22 771 771n16 780n18 Powell M 943n35 Prais S 215n1 273 274 275 318 325 326 360n20 Prentice R 697n29 771 790n35 794 795 797 799 Press S 721n57 Press W 705n47 833n11 919n2 920 928 929 934n23 935n24 935n25

    N
    Nagin D 441 680n14 Nakamura A 414n27 764n10 Nakamura M 414n27 764n10 Nakosteen R 669 670 786 787 National Income and Product Accounts 951 National Institute of Standards and Technology NIST 833n12 Nelder J 147 743 745n71 746 747 935n24 952 Nelson C 343n9 635n17 636n18 Nelson F 666n2 667n3 764n10 768 771 772n17 Nelson R 448n22 Nerlove M 124 125n3 126 127 270n13 295n11 298n15 299n16 301n21 308n27 364 366n30 451 524 573n7 608n1 721n57 950 Neudecker H 837n14 Neumann G 772n17 New York Post 759n6 Newbold P 558n1 579 608n1 621 624 632 633 635 Newey W 140 200 267 280 313n31 314 461 465 482 534n3 540 546 549 550 550n10 554 602 645 697n30 771n16 772n17 780n18 789 Neyman J 697 Nickell S 300n19 307n26

    O
    O Halloran C 690n27 Oates D 794n37 Oaxaca R 53 Oberhofer W 212 229 236 248 299 327 347 349n16 Obstfeld M 401 409 534n3 Ohtani K 133 134 222n8 344n10 Olsen R 480 767 Orcutt G 273 274 275 318 360n20 766

    Greene 50240

    gree50240 Au Ind

    July 16 2002

    21 24

    998

    Author Index

    Psychology Today 774 Puterman M 441

    S
    Salem D 855n5 Salmon M 225 Sargan D 608n1 Sargan J 300n20 308n27 314n32 Savin E 360n19 492n10 Savin N 636n20 958 Sawtooth Software 447n20 Schimek M 458 Schipp B 361n21 Schmidt P 102n5 118 133 287n5 308 310 313 314 341n3 360n19 399n15 403n17 407n21 501 501n18 502 504 505 524 552 554 719 720 768 770 948 Schwert W 240 644 646 Scott E 697 Seaks T 100n4 118n1 173n5 498n16 680n14 Segerson K 362n23 Sen A 140 142 Sepanski J 709n48 Sevestre P 284n2 Shapiro M 596n13 602 Sharpe W 352 353 356 Shaw D 757n4 773 774 Shephard R 366 Shetty C 934n19 Shumway R 624 Sickles R 118 133 283n1 367n33 709n48 Silk J 409n22 Silva J 154n6 682 683 Silver J 362n24 Simar L 283n1 Sims C 381n5 586 588 592 592n10 Singer B 440n15 790n35 Singleton K 526 534n3 549 Siow A 438 439 Smith A 922n6 Smith V 790n34 Snyder J 666n2 Social Security Administration 787 Solow R 144 145 284n4 366 949 Sorbom D 919n1 Sowell F 141 Spady R 704 705 706 Spector L 675 703 951 Spencer D 414 Srivastava K 342 Srivastava V 340n2 344n12 Staiger D 80n9 251 Stambaugh R 240 Stata 449n23 690n26

    Q
    Quah D 598 599n14 Quandt R 162n1 223 249 403 529 537n6 788n31 933 934n19 935 938 938n28 942n33 943n35 Quester A 761 766

    R
    Raftery A 923n7 Raj B 284n2 Ramaswamy V 441 Rao C 479 697n29 909 912 928 940n31 Rao P 211 274 924n8 Rasch G 698 Reirs l O 387n10 Renault E 701n38 772n17 Renfro C 238n21 239n24 245n32 246n34 Revankar N 190 343n8 364n27 449 498 499 505 949 Revelt D 728 Review of Economics and Statistics 714n55 728 932n17 Rich R 4 588 596 599n14 600 639 640 Richard J 153 154n5 253 381n5 382 591 658 659 Ridder G 307n26 Rilstone P 58 Rivers D 772n17 Roberts G 923n7 Roberts H 145 146 Robins R 238 240 Robinson C 669n5 Robinson P 771n16 Rodgers J 935n25 Rodriguez Poo J 669n5 673 704 Rogers W 448 Rosen H 313n31 314 554 602 Rosen S 669n5 788n29 Rosenblatt D 455 Rosett R 764n10 Rothenberg T 645 Rothschild M 238n21 Rubin D 446 774 Rubin H 402n16 413 636 Runkle D 690n26 932n17 Ruud P 165 263 426n1 428 674 701n38 702 736n63 772 772n17 932 932n17

    Statistical Abstract of the United States 760 Stegun I 533 687n23 919n2 921 927 928 929 Stern H 446 Stern S 58 932n17 Stock J 80n9 142n19 251 587n9 592n10 608n1 645 653 654 655 656 Stoker T 702 Stone R 362n23 Strand I 773 790n34 Strang G 826n7 Strauss R 719 720 Strickland A 404 Stuart A 152n2 473 474 477 530n1 849n1 Suits D 118n1 289n8 Susin S 799 Swait J 726 Swamy P 318n34 319 319n35

    T
    Tanner M 922n6 Taqqu M 647n23 Tauchen H 693 Tavlas G 318n34 319n35 Taylor L 448n21 Taylor W 82 211 219n4 301n22 303 304 305 308 310 Telser L 342n5 Terza J 719 745n71 749 774 790 790n34 Teukolsky S 705n47 833n11 919n2 920 928 929 934n23 935n24 935n25 Theil H 86n13 113n11 273 344n12 401 403n17 406 436n9 826n6 833n13 889 917n7 Thompson S 702n41 703n42 Thornton D 608n1 Thursby J 134n13 223n10 368n35 Tibshirani R 924n9 925 Tobin J 761 764 Tomes N 669n5 Topel R 184 509 510 785n26 Toyoda T 133 222n8 Train K 319 445 447 512 514 728 Trethaway M 174 Trivedi P 565n4 740n66 741n68 743 743n70 744 745 746 748 Trognon A 154n5 246n34 316 518n25 521 701n38 772n17

    Greene 50240

    gree50240 Au Ind

    July 16 2002

    21 24

    Author Index Trotter H 938 Trumble D 238 Tsay R 558n1 Tunali I 669n5 686 Walker J 771n16 780n18 Walker M 362n23 Wallace T 298n14 Wallis K 270n13 Wang P 441 Wansbeek T 307n26 389n12 Watson G 270n12 275 Watson M 251 587n9 592n10 596n13 602 608n1 650n25 Waugh F 27 Webster C 907n4 912 Wedel M 441 Weeks M 152n3 155 156n9 683 Weiss A 245n34 790n34 Weiss L 404 Welsch R 58 61 Wertheimer R 766 West K 140 200 267 280 309n29 546 549 550 645 White H 67n3 80n9 106 152n3 178 190 199n6 199n7 200 220n5 220n6 245 246n34 261 263 264n8 265 267 382n6 448n22 518n25 540 546 673 674 897n1 910 914n5 White K 958 White S 448n22 Wichern D 58 324 Wickens M 88 534n3 Willis R 317n33 669n5 788n29 855n7 Windmeijer F 683n16 741n68 742

    999

    U
    U S Department of Commerce 951 Uhler R 683n16 Ullah A 240 453n26 456 460 465 880

    V
    van Praag B 714n53 van Soest A 448n22 702n41 708 762 768n13 770 771 Veall M 579n8 683n16 684 685 Vella F 505n20 507 534n3 771n15 772 772n17 776 Vetterling W 705n47 833n11 919n2 920 928 929 934n23 935n24 935n25 Vilcassim N 728 Vinod H 170 Volker P 155n7 Vuong Q 751 772n17 779 Vytlacil E 697n30

    Winkelmann R 740n66 744 745n71 774 790n34 Winsten C 273 274 275 318 325 326 360n20 Wise D 680n14 725 761n8 Witte A 693 761 Wong W 707 Wood D 368 369 369n36 950 Wooldridge J 14 112n9 245 245n32 246n34 382n6 403n18 691n28 740n66 Working E 378 386n9 Wu D 80 82 83 414n27 Wynand P 714n53

    Y
    Yaron A 537n7 542 Yatchew A 450 451 452 679 Yoo B 656

    Z
    Zarembka P 173n3 Zavoina R 683n16 684 685 736 Zeger S 690n26 Zellner A 125 190 341n4 342 344n11 381n5 406 430n2 431n3 432n4 434n6 436n11 438 439 444 449 498 499 505 878n1 923n7 949 Zimmer M 669 670 786 787 Zimmermann K 683n16 684 685

    W
    Wald A 74n8 260 636 Waldman D 224 768

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    SUBJECT INDEX

    Q
    A
    accelerated failure time models 796 acceptance region 892 ACF See autocorrelation function ACF addition 804 805 809 823 ADF GLS procedure 645 adjusted R squared 34 36 40 159 See also coef cient of determination R2 adjustment equation 568 admissible 389 395 age correlation with education income 9 labor force participation model 681 probit model example 669 regression approach 665 as socioeconomic variable 54 aggregation bias 341 Ahn Schmidt estimator 313 Aitken s Theorem 207 208 Akaike information criterion 36 159 160 565 589 644 algorithms BFGS algorithm 170 BHHH algorithm 242 795 EM algorithm 774 iterative algorithm 935 943 Metropolis Hastings algorithm 445 446 Oberhofer Kmenta algorithm 299 variable metric algorithm 939 Almon lag 566 almost sure convergence 900 901 905 alternative estimators 180 189 alternative hypothesis 95 98 681 analog estimation 536 analysis of behavior 7 analysis of covariance 118 120 analysis of variance 33 38 867 analytic function 926 APC average propensity to consume 2 AR 1 model alternatives to 609 autocorrelation and 581 583 disturbance processes 257 259 Ergodic Theorem 260 gross correlation and 617 Grunfeld investment model 332 linear least squares 622 panel data model 317 restrictions and 584 585 serial correlation 273 274 spectral density function 626 627 stability condition 573 testing unit roots 636 637 AR 2 model ACF and PACF example 622 623 restrictions and 584 585 serial correlation 272 274 stationarity requirement 613 testing common factor restrictions 586 ARCH in mean model 240 242 ARCH model 216 238 242 244 ARDL See autoregressive distributed lag ARDL model Arellano Bond and Bover estimator 308 ARFIMA model 647 649 ARIMA model 632 ARMA model See autoregressivemoving average ARMA model associative law 806 assumption 1 163 asymptotic covariance matrix BHHH algorithm 795 BHHH estimator 507n22 512 741 773 797 CES production function 129 de ned 915 estimating 71 77 78 GLS estimator and 212 321 342 GMM estimator 140 204 409 410 Hausman test 302 instrumental variable estimator 79 Lagrange multiplier test 489 492 MLE 347 498 500 672 673 688 MPC example 110 MSL estimation 514 OLS estimation 216 217 323 parameter estimator and 169 Phillips curve example 569 production function example 499 QMLE 673 674 robust estimation of 198 201 stochastic frontier model 504 theorem 184 Wald statistic 550 weighting matrix and 312 asymptotic distribution delta method 70 of empirical moments 542 of GMM estimator 543 in GR model 196 197 with independent observations 68 large sample distribution theory 914 918 least squares estimator and 105 nonlinear instrumental variables estimator 183 theorem 77 asymptotic ef ciency de ned 70 71 916 MLE property 473 479 480 as statistical property 460 asymptotic expectations 917 918 asymptotic moments 918 asymptotic negligibility 264 asymptotic normality consistency 72 de nition 916 GMM estimator 543 of least squares 195 260 least squares estimator 69 M estimators theorem 464 MLE property 473 478 479 nonlinear least squares estimator 167 169 regression estimation 621 Slutsky Theorem 184 as statistical property 460 stochastic regressors 74 asymptotic properties assumptions 461 463 de ned 885 of estimators 464 465 of instrumental variables estimator 196 197

    1000

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index least squares 65 74 194 196 265 267 MCSE 688 method of moments estimator 531 533 MLE 476 482 493 689 of parameter estimators 65 of regression models 72 asymptotic uncorrelatedness 264 asymptotic variance 479 482 attenuation 83 90 761 attributes 720 augmented Dickey Fuller test 643 646 658 autocorrelation See also nonautocorrelation ARDL models 571 581 582 de ned 15 of disturbances 545 Durbin Watson test and 126 estimation of models with 274 276 forecasting and 279 280 generalized regression model and 191 GLS estimator and 209 GNP de ator 634 Grunfeld investment model 333 labor force participation study 710 long memory models 647 models with 195 negative autocorrelation 251 253 647 panel data 317 318 324 326 stationarity assumption and 258 259 of stationary stochastic process 614 616 SUR model 360 362 testing for 268 271 time series data and 192 255 256 autocorrelation coef cient 256 autocorrelation function ACF AR 2 model 622 624 of AR process 618 correlogram as counterpart 621 gross correlation 617 moving average process and 616 spectral density function 626 stationary stochastic processes 614 autocorrelation matrix 256 autocovariance 256 264 626 630 autocovariance at lag k 612 625 autocovariance function 614 625 autocovariance matrix 256 autoregression See AR 1 model univariate autoregression vector autoregression VAR autoregressive fractionally integrated moving average ARFIMA model 647 649 autoregressive conditional heteroscedasticity 238 246 autoregressive distributed lag ARDL model 571 579 660 autoregressive form 257 563 611 autoregressive integrated moving average ARIMA model 632 autoregressive moving average ARMA model ADF GLS procedure 645 ARCH model and 240 autocorrelation of stationary stochastic process 614 616 frequency domain 624 631 nonstationary processes 631 632 parameters for univariate time series 621 624 partial autocorrelation 617 619 stationarity and invertibility 611 614 stationary stochastic processes 609 611 univariate time series 619 621 Yule Walker equations for 616 average propensity to consume APC 2

    1001

    B
    bandwidth 454 456 458 Bartlett window 628 645 basis for a vector space 813 basis vectors 811 Bayes factor 153 438 439 Bayesian estimation 290 318 426 429 439 461 512 Bayesian information criterion 36 152 153 160 589 644 Bayes theorem 429 430 437 443 behavioral equations 380 Behrens Fisher problem 133n11 Ben Akiva measure 719 Bernoulli distribution 855 Bernstein von Mises Theorem 447 best linear unbiased BLU 193 best linear unbiased estimator BLUE 890 beta consistency of least squares estimator 66 67

    Gauss Markov theorem 48 least squares and 84 beta distribution 855 beta function 928 beta kernel 455 beta parameter 65 between groups estimator 289 290 BFGS algorithm 170 BFGS method 939 BHHH algorithm 242 795 BHHH estimator asymptotic covariance matrix 741 773 797 condition moment tests 507 example 482 GMM estimation 550 hypothesis tests and 673 678 Lagrange multiplier test 490 492 500 769 likelihood test 491 MLE and 481 production function example 499 pseudo MLE 246 520 two step MLE 511 512 bias aggregation bias 341 biased test 894 xed effects models 697 least squares estimator and 679 measurement error and 85 model and 160 omission of relevant variable and 148 149 pretest estimator and 150 in sampling 673 simultaneous equations bias 379n3 396 testing aggregation bias 341 truncated regression model 761 biased estimator 150 binary choice models bivariate probit models 710 719 data 951 dynamic binary choice models 708 710 estimation and inference 670 689 estimator comparisons 705 goodness of t measure 683 686 hypothesis testing 676 678 individual effects example 700 latent regression 668 670 log likelihood function 688 marginal effects 674 676 712 713 maximum score estimator 702 704 MSL estimation 514 517

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1002

    Subject Index canonical correlation 657 659 capital asset pricing model CAPM 240 339 351 357 Cauchy Schwartz inequality 92 904 causality 381 382 590 593 CDF See cumulative distribution function CDF censored data estimation 766 768 normal distribution 762 764 speci cation 768 773 tobit model 764 766 censored regression tobit model 764 766 censored variables 762 763 censoring applications 774 780 dependent variables 761 heteroscedasticity and 768 model for counts 773 774 central limit theorem See also Lindberg Feller central limit theorem Lindberg Levy central limit theorem Lyapounov central limit theorem chi square and 108 consistent estimation 527 convergence to normality 262 265 dependent observations and 260 depicted 911 GMM estimation 542 Gordin s central limit theorem 265 large sample distribution theory 908 913 limiting normal distribution 532 martingale difference 263 273 463 542 moment condition tests 506 507 random variables 859 central moments 529 848 CES constant elasticity of substitution 129 162 ceteris paribus analysis 28 characteristic equation 574 614 825 characteristic roots 825 827 830 831 characteristics 720 characteristic vectors 825 827 830 Chebychev s inequality 848 898 903 Chebychev theorem 463 900 901 912 chi squared distribution degrees of freedom and restrictions 110 678 distribution theory 851 853 Lagrange multiplier statistic 177 489 Lagrange multiplier test 224 noncentral chi squared distribution 487n8 normal disturbances and 104 105 statistical tables 955 Wald criterion 96 302 chi squared statistic degrees of freedom 155 Hausman test and 699 testing restrictions 172 choice based sampling 673 730 choice models 719 735 Cholesky decomposition 922n4 Cholesky factor 445 446 832 932 Chow test 130n8 132 133 135 136 139 681 Civil Aeronautics Board 118 classical regression Bayesian analysis 430 434 estimator case 204 gross correlation and 617 heteroscedasticity and 314 323 homoscedastic disturbance and 215 marginal effects 560 nonstochastic regressors and 590 591 normal linear model 872 873 ordinary least squares and 341 panel data 289 posterior odds for 438 439 weighted least squares 240 closed form solution 934 CMLE conditional maximum likelihood estimator 699 Cobb Douglas model LAD estimation 449 450 log linear model 12 Nerlove s study and 125 126 nonnormal disturbances 502 production function example 102 104 498 499 systems of equations 363 365 translog cost function 366 367 Cochrane Orcutt estimator 273 275 318 360n20 coef cient of determination R2 See also adjusted R squared analysis of variance 34 867 classical regression 686 comparing models 37 38 constant term and 36 37 depicted 33 hypothesis testing 678 Lagrange multiplier as 680 940

    binary choice models continued multivariate probit models 710 719 proportions data 686 689 random utility models 670 regression 665 668 sample selection 713 714 semiparametric analysis 700 702 semiparametric estimation 452 453 704 706 speci cation tests 679 683 binary variables categories 117 118 groupings 118 120 marginal effect 740 probability model example 676 in regression 116 117 spline regression 121 122 binomial distribution 856 bivariate distribution 781 782 863 868 bivariate probit models 710 719 bivariate random variables 862 864 bivariate regression 22 23 453 block diagonal matrix 823 824 Boot de Witt data 348 bootstrapping computation 920 924 925 discrete choice models 702 703 inference 113 lagged variables 579 595 600 box and whisker plot 879 882 Box Cox regression model 498 500 501 Box Cox transformation 171 173 175 179 Box Jenkins methods 619 621 649 Box Ljung statistic 274 276 623 624 Box Pierce statistic 271 274 276 622 624 Box Pierce test 269 270 Breusch Godfrey test 269 270 Breusch Pagan LM test 223 225 769 Broyden Fletcher GoldfarbShanno BFGS method 939 Broyden s method 939 Bureau of Economic Analysis BEA 282 Butler and Mof tt method 692 694 700 715

    C
    calculus matrix algebra 837 845 CAN estimators 460 473 917 CAN functions 480n3

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index multiple regression 35 nonlinear setting and 943 signi cance of regression 54 55 theorem 34 coef cients changes in subsets 132 133 as elasticities 123 individual regression 27 linear restrictions on 122 signi cant effects 739n65 testing hypotheses about 50 52 676 cofactor 817 840 cointegrating rank 652 659 cointegrating vectors 650 652 653 659 cointegration common trends 653 654 German money demand 657 660 long memory models 647 testing for 655 657 VAR representations 654 655 cointegration relationship 658 collinearity 154 470 column rank 814 815 column space 814 815 column vector 803 816n3 common factor 583 586 common factor model 278 279 common trend 653 654 compactness 461 completeness condition 384 complete systems 379 comprehensive model 153 computation optimization and 919 946 concavity 461 462 840 concentrated log likelihood 349 495 498 940 941 945 946 conditional density 427 530n1 conditional distribution 864 867 conditional likelihood function 482 483 698 699 conditional logit model 720 723 724 729 735 conditional maximum likelihood estimator CMLE 699 conditional mean 14 15 457 676 713 864 conditional moments relationships 865 867 tests 505 508 743 771 conditioning bivariate distribution 864 867 condition number 56 58 829 con dence interval con dence interval test 491 impulse response function 595 latent class model example 442 for linear combination of coef cients 53 54 nonlinear consumption functions 172 normal distribution and 55 normal mean 891 892 for parameters 52 53 Phillips curve example 570 prediction interval and 111 tests based on 895 896 con dence level 891 conformable for addition 804 conformable for multiplication 805 congruential generators 921 conjugate prior 435 consistency asymptotic ef ciency of estimators and 71 asymptotic normality 72 criterion function and 462 GMM estimator and 204 least squares estimator 66 67 679 as LS property 518 maximum likelihood estimator and 690 mean of functions 900 M estimators theorem 464 MLE property 473 477 478 nonlinear least squares estimator 167 169 nonlinear restrictions and 109 of OLS in generalized regression model 194 regression estimation 621 of sample mean 899 of s squared 69 as statistical property 460 stochastic regressors 74 superconsistent 572 consistent estimator See also White estimator distribution theory 899 estimation frameworks 463 GMM 526 533 least squares 70 simultaneous equations models 397 398 405 consistent test 894 895 constant elasticity 11 12 constant elasticity of substitution CES 129 162 constant returns hypothesis 103 104 126 constants coef cient of determination and 36 37 dummy variables as 482 gasoline consumption study 132

    1003

    observations with 272 prediction and 111 random effects model and 694 regression and 15 28 40 constant variance See homoscedasticity constrained optimization 842 843 constraints 941 942 consumption application data set 946 947 cointegration in 650 652 656 economic analysis and 624 as economic variable 631 error correction and 580 macroeconomic model 380 as macroeconomic variable 649 permanent income model of consumption 8 525 548 rational lag model example 575 relationship to income 3 8 9 consumption function binary variables and 117 118 Cox test for 157 example 33 Hausman test for 83 J test for 155 Keynes 1 2 8 9 587 least squares and 75 macroeconomic models 380 381 nonlinear 171 173 nonlinear instrumental variables estimator 183 contagion property 859 continuous variables 845 846 857 858 contrasts 291 convergence assessing 943 in distribution 906 908 of empirical moments 541 forms of 900 903 of functions 903 904 in lower powers 902 in mean 902 905 of moments 203 260 262 905 to normality 262 265 in probability 897 903 in quadratic mean 897 to random variables 904 905 convex 840 correlation 56 712 861 862 879 See also autocorrelation serial correlation correlation matrix 879 correlogram 621 cosine kernel 455 cosine law 819

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1004

    Subject Index production functions 284 time effects and 283 cumulated effect 561 cumulated multipliers 416 cumulative distribution function CDF computing integrals 926 927 discrete choice models 692 693 710 hazard function 792 limiting distribution 906 probability theory 846 857 sampling distributions 882 truncated normal distribution 757 CUSUM test 135 139 cyclical variation 624 de nite matrices matrix algebra 834 837 degree of inconsistency 87 degree of truncation 759 degrees of freedom adjusted R2 35 chi squared distribution 678 chi squared statistic and 155 distributions with 853 854 least squares and 566 maximum likelihood estimators and 493n12 number of restrictions and 110 partial correlation coef cient 35 restrictions and 489 593 sample periodogram and 627 testing hypotheses 106 delta method ARDL model and 573 asymptotic distribution of function 70 CES production function 129 impulse response function 595 large sample distribution theory 913 914 marginal effects and 674 Phillips curve example 569 standard error and 128 172 173 175 776 stochastic frontier model 504 theorem 527 two step estimation 188 demand elasticities of demand 12 52 53 for gasoline 570 571 macroeconomic model example 380 for money 657 660 demand equations common structures 339 340 deterministic relationships 7 example 378 inverse demand equations 7 8 multivariate regression model 362 stability of 658 testing model instability 660 demand system 364 DeMoirve s theorem 625 density hazard function 792 negative binomial model 745 parametric estimation 427 probability density function 468 properties of 474 475 of truncated random variable 757 dependent observations 73 74 260 541

    cost function airline production example 286 data sets 948 950 example of exible 174 functional form 126 groupwise heteroscedasticity 236 237 nonparametric example 458 459 translog cost function 366 369 count data censoring and truncation 773 774 de ned 663 discrete choice models 740 752 covariance of disturbances 15 16 estimation and inference 879 identi cation through restrictions 394 395 joint distribution 861 862 theorem 865 covariance matrix estimation 940 estimation and inference 879 inference and prediction 100 least squares estimator 48 217 219 for ordinary least squares 219 221 probability theory 869 covariance stationarity 612 covariance stationary 254 612 covariance structures 286 314 320 334 covariates 7 Cox statistic 156 157 Cox test 155 159 682 683 See also sum of squared residuals CPI Dickey Fuller test 638 in ation studies 600 602 investment equation 21 restricted investment equation 98 Cramer Rao lower bound 429 473 479 480 493 889 890 Cramer Wold device 908 criterion function asymptotic property 461 462 for estimation 704 GMM estimations 537 critical region 892 cross sectional data covariance structures for 320 334 estimation 878 heteroscedasticity 215 238 limitations of 284

    D
    data behavior of 463 data problems 56 61 default data 952 deseasonalizing 118 duration data 790 792 economic analysis and 624 education data 953 frequency data 723 individual data 686 linear transformations exercise 39 and methodology 4 5 ordered data 736 740 well behaved data 478 483 data generating process DGP assumptions 17 65 66 computation and optimization 920 923 estimation and inference 880 generalized method of moments 533 linear regression model and 10 18 nonlinear model 163 164 nonstationary processes 635 probability density function 468 random variables 845 regression model and 72 data series 131 132 262 data sets 283 946 953 Davidon Fletcher Powell DFP method 938 939 942 945 Deaton statistic 156 decomposition singular value decomposition 833 symmetric matrix 835 of variance 48 625 628 866

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index dependent variables See also independent variables binary choice model 665 censoring 761 762 computing considerations 37 de ned 837 discrete choice models 663 664 jointly dependent variables 379 lagged response 571 least squares estimator and 42 linear regression models 7 9 maximum likelihood estimator 686 measurement error 84 Phillips curve example 569 prediction and 111n8 price and quantity as 8n1 regression model and 33 transformation of 174 uncommon usage 673 variations as deviations from mean 31 depreciation 84 derivatives computing 933 of empirical moments 541 maximum likelihood estimation 840 probit and logit models 675 Slutsky theorem 668 deseasonalizing data 118 determinant 816 817 823 830 840 deterministic relationship 2 7 8 detrending 635 636 deviance 742 deviations correlation of 15 of costs 125 from means 824 from production function 502 variation in dependent variable 31 DGP See data generating process DGP diagonal matrix 803 816 823 827 Dickey Fuller test 602 637 646 655 656 difference operators 562 564 differencing integrated processes 631 632 long memory models 647 manipulating series via 649 and white noise 635 differentiation Taylor series 837 840 digamma function 928 dimensions 803 discrepancy vector 96 101 108 discrete 845 discrete choice models See also binary choice models count data models 740 752 features 663 664 logit models 719 735 ordered data 736 740 discrete Fourier transform 630 discrete population 922 discrete random variables 845 855 856 discriminant analysis 685n19 disposable income 575 distributed lag autoregressive distributed lag models 571 579 form 563 573 marginal propensity to consume 109 models with lagged variables 565 571 distribution See also gamma distribution contagion property 859 degrees of freedom 853 854 of function of random variable 856 858 heterogeneity in 72 idempotent quadratic forms 874 for logit model 667 parameters of 527 531 of standardized normal vector 876 distribution theory central limit theory and 262 large sample 896 919 probability and 845 877 distributive law 806 disturbance ARDL models 582 assumption 73 asymmetrically distributed 71 autoregression and 609 bootstrapped 579 GMM estimation 545 heteroscedasticity and 191 222 independent variables and 10 14 as innovations 610 maximum likelihood estimation 679 nonnormal disturbances 501 505 normal distribution and 17 518 population regression and 19 serial correlation 256 259 stable relationships and 8 as stationary 611 SUR model linkages with 342 and testing 104 108 variances and covariances 15 16

    1005

    as white noise 643 zero mean of 163 disturbance variance 133 dominant root 417 dot product 805 double length regression 243 downhill simplex 935 duality theory of 125 dummy variables computing marginal effects 668 as constants 482 criterion function and 462n27 in earnings equation 116 117 elasticities and 123 xed effects models and 695 LSDV and 289 probability model example 676 in production of airline services study 118 120 speci cation issues 768 treatment effect 788 dummy variable trap 118 duration dependence 794 duration model 756 790 798 Durbin s test 271 Durbin Watson statistic 275 276 582 583 958 Durbin Watson test 126n6 270 271 dynamic equation 573 576 dynamic models binary choice models 708 710 lagged effects in 560 562 methodological issues 579 586 properties of 415 420 simultaneous equation models 380 dynamic multipliers 415 417 420 dynamic panel data model 75 307 314 551 555 dynamic regression 75 dynamic regression models 558 564

    E
    earnings equation 55 116 117 See also income econometric model 1 4 125 379 380 482 483 544 547 econometrics See also capital asset pricing model CAPM Almon lag 566 computation in 925 933 data analysis and 284 de ned 1 GMM estimation in 447 448 growth in eld 4 5 identi cation in 621 model concept 160 QR models 689 strong stationarity 612n5

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1006

    Subject Index of substitution 12 translog cost function 368 travel mode choice example 733 U S manufacturing example 369 EM algorithm 774 empirical moment equation 202 534 541 encompassing principle 153 endogeneity 379 381 382 endogenous variables distinction 381 dynamic models 416 nonlinear model example 404 VARs and 587 Epanechnikov kernel 455 equality null hypothesis 289 of row and column rank 815 equations See also systems of equations adjustment equation 568 characteristic equation 574 complete systems of 379 stability of dynamic equation 573 576 equilibrium adjustment to 418 420 dynamic models and 560 and substitution 594 equilibrium condition 378 380 390 equilibrium error 579 652 654 equilibrium multiplier 416 417 562 equilibrium relationship 579 655 ergodic 74 262 621 ergodicity 261 262 559 621 Ergodic theorem 260 262 273 541 621 Erlang distribution 854 error correction 579 581 650 error correction model 654 655 659 660 error function 926 estimable parameters 469 estimation See also methods of estimation parametric estimation of ARDL model 572 573 based on orthogonality conditions 534 536 in binary choice models 670 689 censored data 766 768 change points and historic events 142n19 cointegration relationships 658 ergodicity and 261 exercise 39 40 in nite sample 885 888 GLS 207 211 370 371 hypothesis testing and 95 inference and 877 896 with informative prior density 435 437 instrumental variable approach 308 investment equation 21 24 least squares 42 48 49 266 267 linear regression models and 93 minimum distance estimators 205 of models with autocorrelation 274 276 nonparametric estimation 453 459 parameters for univariate time series 621 624 parameters of distributions 527 531 properties 460 465 qualitative choices and 664 in selection model 784 787 semiparametric estimation 447 453 serial correlation 273 277 standard error of 49 128 of SUR models 350 351 with unknown parameters 227 232 VAR 588 589 597 600 estimation criterion 427 estimation methods See methods of estimation estimators alternative estimators 180 189 asymptotic covariance matrix 69 of asymptotic covariance matrix 198 asymptotic ef ciency and 71 asymptotic properties of 464 465 least squares 41 42 minimum distance estimator 205 206 538 539 statistical properties of 460 statistics as 882 885 truncation and windowing 627 628 within and between groups 289 290 Euler equation 526 527 Euler s theorem 284n4 E Views computer program 244n31 377 exactly identi ed 129 536 548 exactly identi ed model 411 exchange rates data 949 economic analysis and 624 long memory models 647 as macroeconomic variable 649

    economics ceteris paribus analysis 28 data analysis and observation frequency 624 econometrics and 1 economic variables 619n11 631 economies of scale 125 126 284 499 education descriptive statistics example 880 as human capital variable 54 labor force participation model 681 lack of measurement for 84 observable indicators and 87 partial correlation coef cient 28 29 regression approach 665 relationship to income 9 10 study with labor 87 90 threshold effects 120 treatment effects 788 ef ciency asymptotic ef ciency 70 71 460 461 of FGLS estimator 210 273 of GLS estimator 217 in production of airline services study 118 120 as statistical property 41 ef cient 71 79 886 ef cient estimator covariance of 301 generalized regression model 210 least squares as 572 maximum likelihood estimation 470 472 526 serial correlation 271 273 simultaneous equation models 414 ef cient scale 92 ef cient score test 489 ef cient two step estimator 244 ef cient unbiased estimator 886 888 890 eigenvalue 659 827 elasticity coef cients as 123 cointegrating vector example 653 constant elasticity 11 12 of demand 12 52 53 estimates 7 lagged variables and 570 money demand example 658 MPC and 109 partial adjustment model and 568 of probabilities 723

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index purchasing power parity theory 650 exclusion restriction 102 348 349 394 404 exogeneity of GDP 659 of interest variable 658 linear regression model assumptions 10 42 long run models 659 nonlinear model assumption 164 regression model 72 vector autoregression 590 592 exogenous variables assumption 542 cointegration 652 in context of models 591 distinction 381 duration models 796 797 forecasting and 576 identi cation and 708 labor force participation example 709 macroeconomic model example 380 simultaneous equation models 379 speci cation tests 414 expansion by cofactors 817 expectation 558 567 865 904 expectations augmented Phillips curve 251 expenditure system 362 explained variable 7 explanatory variable 7 exponential distribution depicted 910 distribution theory 855 likelihood functions 888 889 limited models 771 794 exponential family 529 530 ex post forecast 113 extended product 807 extramarital affairs data 952 extremum estimator 461 463 520 Cobb Douglas model 103 xed effects 292 gasoline consumption study 136 137 571 hypothesis testing 177 least squares 83 95 99 220 linear model and 175 maximum likelihood estimation 496 normal disturbances and 104 105 robust estimation and 200 signi cance tests for restrictions 175 177 SUR model example 350 testing common factor restrictions 586 testing hypotheses 106 114 testing joint signi cance 82 Wald test and 593 F test 130 592 632 factoring matrix 832 833 fast Fourier transform FFT 631 FDI variables 700 feasible GLS FGLS AR 1 model 273 AR 2 model 274 autocorrelation and 253 317 582 binary choice models 665 generalized regression model 209 211 groupwise heteroscedasticity 236 Grunfeld investment model 331 instrumental variables estimator and 277 multiplicative heteroscedasticity 234 panel data 322 random effects model 296 299 304 restrictions 689 SUR model 344 347 two step estimation 227 228 231 FIML See full information maximum likelihood FIML nal form 416 nite lags 560 565 566 nite sample properties estimation in 885 888 least squares 55 56 65 Ljung Box statistic 622 of ordinary least squares 193 194 unbiased estimation 41 rst generation RCM 319 rst order autoregression See AR 1 model rst order condition 840 t measures 159 209

    1007

    F
    F distribution distribution theory 851 853 example 908 noncentral F distribution 852 probability theory 875 statistical tables 956 957 F ratio 172 289 F statistic adjusting 685 ARDL model 574 583 Chow predictive test and 132

    tting criterion 19 xed effects model binary choice model extensions 690 cost equations 292 lagged dependent variables and 307 panel data 285 287 293 694 700 robust estimation 314 316 exible functional form 12 366 369 forecast error 111 576 forecasting See also prediction accuracy of 113 adjusted R2 159 ARDL model and 576 579 ARMA models 610 autocorrelation and 279 280 distinction 111n8 as growth industry 5 Klein s Model I 587 macroeconomics and 587n9 model performance 608 prediction and 113 regression analysis and 33 VAR approach 595 foreign exchange markets 238 649 forms of convergence 900 903 Fourier transform 705n47 fractional integration 632n14 647 648 frequency domain 624 631 Frisch Waugh theorem 27 38 39 full column rank 815 full information 396 398 full information maximum likelihood FIML discrete choice models 716 727 joint estimation and 405 Klein s Model I 412 labor supply example 786 method of estimation 407 409 travel mode choice example 732 two step maximum likelihood estimation 508 full rank assumption 542 least squares 21 linear regression model 10 13 14 42 regression model and 72 VAR model and 655 full rank matrices 815 816 full rank quadratic form 875 876 fully recursive model 394 395 397 411 functional form 116 122 124 126 163

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1008

    Subject Index random effects model 295 296 316 SUR model and 341 343 weighted least squares 224 generalized method of moments GMM 426 447 559 590 generalized method of moments GMM estimator asymptotic ef ciency 460 CAPM model and 356 consistent estimation 526 533 convenience of 139 demand for money example 143 discrete choice models 690n26 dynamic panel data model 307 314 551 555 as extremum estimator 461 features 533 547 GLS and 371 heteroscedastic regression model 221 identi cation 463 identi cation condition for 203 important results 201 207 joint estimation and 405 Klein s Model I 412 least squares 43 165 LR statistic 593 method of estimation 400 401 409 410 as modeling framework 465 municipal expenditures example 604 nonlinear least squares 169 nonlinear systems 372 373 optimal 140 ordinary least squares 588 probit model with random effects 694 pseudo MLE 246 518 random effects model 308 309 restrictions in 142n18 semiparametric estimation 447 448 serial correlation 268 testing hypotheses 548 551 generalized regression GR model asymptotic distribution 196 covariance matrix and 321 nonspherical disturbances 191 214 ordinary least squares 521 R2 and 209 time series cross sectional data 320 generalized residual 671n8 793 generalized sum of squares 209 211 229 general to simple method 151 152 564 583 589 geometric lag model 566 571 GHK simulator 710 932 933 Gibbs sampler 445 446 922 923 globally concave 840 globally convex 840 global maximum 840 GLS See generalized least squares GLS GMM See generalized method of moments GMM GMM estimator See generalized method of moments GMM estimator GNP GNP de ator 634 647 investment equation 21 23 long memory model example 648 spectral analysis of growth rate 628 631 Godfrey LM test 223 225 golden section method 934n23 Goldfeld Quandt test 223 224 goodness of t 31 38 209 345 goodness of t measure 683 686 741 743 Gordin s central limit theorem 265 GPH test 649 GQOPT computer program 942 GRADE model 703 gradient 171 838 937 gradient methods 935 939 943 Granger causality lagged variables 587 589 592 593 604 605 simultaneous equations models 382 time series models 658 659 Granger noncausality 591 Granger representation theorem 654n26 Grenander conditions 67 68 194 566n5 grid search 934 GR model See generalized regression GR model grouped data 686 688 689 group means 288 315 group means estimator 290 groupwise heteroscedasticity consistent estimation 317 estimation 327 Grunfeld investment model 333 heteroscedasticity 223 232 235 237 296 panel data 323 325 speci cation issues 768

    G
    gamma distribution computation and optimization 944 945 distribution theory 855 example 530 531 GMM estimation 538 540 limited models 794 MLE 490n9 negative binomial model 745 gamma function 490n9 927 928 gamma regression model 71 129 GARCH model 16 240 245 Gauss computer program 942 Gauss Hermite quadrature 692 929 Gaussian quadrature 928 Gauss Laguerre quadrature 929 Gauss Markov theorem counterparts to 70 71 generalized regression model 208 least squares estimator and 45 47 56 265 prediction and 111 stochastic frontier model 502 503 theorems 47 48 Gauss Newton method 169 942 Gauss s method 448 GD 84P 83 GDP augmented Dickey Fuller test 646 cointegration and 650 652 cointegration in 656 Dickey Fuller test 639 as economic variable 631 exogeneity of 659 long memory models 647 nonstationary series example 632 GEE estimator 690n26 generalized inverse 82 83 833 834 generalized least squares GLS See also feasible GLS FGLS asymptotic covariance matrix 212 321 342 asymptotic normality of 260 autocorrelation 253 ef ciency of 217 ef cient estimation 207 211 271 groupwise heteroscedasticity 235 heteroscedastic regression model 216 217 227 log likelihood function 688 nonlinear systems and 370 371 panel data 321

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index Grunfeld Boot and de Witt investment model 339 Grunfeld s investment data 329 333 340 Gumbel distribution 720 labor force participation model 682 linear regression model 679 models with 195 multiplicative heteroscedasticity 232 235 239n24 nested logit models 726 ordinary least squares estimation 216 221 random effects model 316 317 robust estimation 198 speci cation issues 768 769 speci cation tests 680 682 structural break and 133 SUR model 360 362 testing 222 225 508 travel mode choice example 733 heteroscedastic logit model 727 heteroscedastic regression applications 232 237 discrete choice models 688 GMM estimator 221 heteroscedasticity 215 216 nonlinear weighted least squares 687 two step estimation of 231 HEV model 733 Hierarchical Bayes estimation 444 447 hierarchical regression 319 histogram 454 880 885 911 homogeneity 699 homogeneity restriction 347 351 700 homogeneous equation system 820 homoscedastic extreme value HEV distribution 727 homoscedasticity CAPM model 356 conditional moment tests 506 de ned 15 groupwise heteroscedasticity 236 237 linear regression 10 42 867 nonlinear model assumption 163 null hypothesis of 224 probability theory 865 regression models 72 testing for 681 682 time series data and 192 travel mode choice example 733 L Hopital s rule 174 500 HPD highest posterior density interval 435 437 hurdle model 749 752 hyperplane 33 38 813 hypothesis testing approaches to 95 104 BHHH estimator 673

    1009

    H
    Hansen s test 134 Harvey s model of heteroscedasticity 328 hat matrix 60 Hausman and Taylor estimator 303 304 308 311 Hausman test chi squared statistic 699 instrumental variable and 80 83 least squares 90 nonnormality 771 panel data 301 random effects model 301 303 travel mode choice example 731 hazard function 759 793 794 799 859 hazard rate 792 799 Heckit estimator 784 Hermite polynomials 926 Hermite quadrature 694 745n71 Hessian 838 945 heterogeneity in binary choice model 700 conditioning and 699 in distributions 72 duration models 797 798 hazard function and 799 latent heterogeneity 440n15 modeling 318 negative binomial regression model 744 745 panel data and 283 286 Poisson model 748 random effects model 690 heteroscedastic 2SLS H2SLS 401 heteroscedastic 3SLS H3SLS 411 heteroscedastic extreme value HEV model 733 heteroscedasticity See also groupwise heteroscedasticity classical regression 314 323 conditional moment tests 506 disturbances and 545 estimator case 204 generalized regression model and 191 GLS estimator and 209 GMM estimator 206 Grunfeld investment model 331 household expenditures and 15 Klein s Model I 412 413

    binary choice models 676 678 Cox statistic 156 estimation 465 892 896 GMM estimation 548 551 least squares estimator 48 50 52 linear regression models and 93 maximum likelihood estimation 484 492 nonlinear consumption functions 172 nonlinear regression models 175 180 nonlinear restrictions 108 110 nonnested hypothesis 153 parametric estimation 437 439 t and F tests 632 Wald statistic 327 551 590 741 Wald test 676

    I
    I 1 series cointegration 650 macroeconomic ows 632 structural variables and 636 testing unit roots 637 idempotent 24 25 79 idempotent matrix 808 809 idempotent quadratic forms 836 873 875 identically distributed iid 263 468 477 532 identical regressors 343 344 identi cation assumption 541 covariance restrictions 394 395 de ned 85 469 exogenous variables 708 intrinsic linearity 127 130 M estimator and 463 of model parameters 163 moment equations assumptions 203 of parameters 468 470 problem of identi cation 385 389 rank and order conditions 389 394 stationary stochastic processes 621 structural VAR model 597 600 identi cation condition 13 129 542 identi cation problem See problem of identi cation identity matrix 803 ignorable case 59 IIA See independence from irrelevant alternatives IIA iid See identically distributed iid impact multiplier 416 561

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1010

    Subject Index individual effect 285 700 inequality Cauchy Schwartz inequality 92 904 Chebychev s inequality 848 898 903 Jensen s inequality theorem 477 849 902 904 likelihood inequality 477 Markov s inequality 898 903 inference in binary choice models 670 689 estimation and 877 896 parametric estimation and 427 447 vector autoregression 600 in nite lag model 560 565 571 in nite lags 567 in ation studies ARCH model and 238 CPI and 600 602 in ation data 949 structural VAR and 596 information matrix 479 489 498 890 information matrix equality 474 476 informative prior 432 435 437 439 initial conditions 255 416 708 innovation 254 264 594 610 instability demand for money 142 143 testing 659 660 of VAR model 605 instrumental variables empirical moment equation 541 estimation by 397 398 FGLS and 277 GMM estimator 540 545 548 Hausman s speci cation test and 80 83 method of moments estimation and 201 possibilities with 202 twins study 88 two stage least squares and 74 80 instrumental variables estimator GMM estimator 310 313 least squares 192 197 method of estimation 397 398 nonlinear model 181 183 random effects model 303 306 simultaneous equations model 379 speci cation tests 414 insuf cient observations 131 132 integrals 926 928 929 integrated hazard function 793 integrated of order one 631 integrated processes 631 632 integrated series See I 1 series intelligence lack of measurement for 84 observable indicators and 87 interaction term 123 124 interdependent systems 379 interest rates ARCH model and 238 cointegrating vector example 652 653 exogeneity of variable 658 investment equation and 21 measurement dif culties for 8 testable implications and 93 94 as variable 84 interval estimate 435 885 890 892 intrinsic linearity 127 130 invariance 473 480 invariance of maximum likelihood estimators 359 inverse 836 inverse function 844 921 inverse Gaussian distribution 528 794 inverse matrices 820 822 831 inverse Mills ratio 759 789 inverses 823 824 inverted gamma distribution 431 436 invertibility 611 614 invertible polynomial 564 investment equation analysis of variance for 33 34 estimating 21 24 prediction for 111 113 restricted example 98 99 semilog equation in 123 investment model application data set 947 Grunfeld s investment data 329 333 investment data 950 macroeconomic model example 380 testable implications for 93 irrelevant variables 150 151 iterated expectations 865 iteration 171 936 944 iterative algorithm 935 943

    importance function 931 impulse response 587 594 impulse response function 420 593 595 incidental parameters problem 690 697 incidental truncation 780 782 inclusion of super uous variables 148 inclusive value 726 income application data set 946 947 data set 953 descriptive statistics example 880 disposable income 575 earning equation 51 52 income elasticity 53 as independent variable 8n1 kernel density estimator 881 882 Keynes consumption function and 1 2 as macroeconomics variable 649 partial correlation coef cient 28 29 permanent income 8 84 525 548 relationship to consumption 8 relationship to education 9 10 unit root 650 voting behavior and 665 inconsistency degree of inconsistency 87 in Gauss Markov theorem 76 least squares 75 inde nite matrix 835 independence 875 877 independence from irrelevant alternatives IIA 724 726 731 733 735 independent observations assumption 66 asymptotic distribution with 68 de ned 878 regression model and 72 independent variables See also dependent variables de ned 837 and disturbance 10 income as 8n1 lagged response 571 lagged variables and 307 linear regression models 7 8 42 marginal effects for 668 measurement error and 84 regression model and 72 theorem 51 index function model 668 670 695 indicator 87 indirect least squares 396 individual data 686

    J
    J test 154 155 178 jackknife 220 924 925 Jacobian de ned 844 distribution of function of random variables 863

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index limited model 795 MLE 493 497 499 500 probability theory 845 Jensen s inequality theorem 477 849 902 904 joint density 860 864 922 joint distribution 860 864 jointly dependent variables 379 joint posterior distribution 433 autoregressive distributed lag models 571 579 distributed lag models 565 571 Durbin Watson test and 270 dynamic models 558 564 579 586 estimation with 277 forecasting and 111n8 GMM estimator 310 545 panel data and 307 random effects model 308 testing with 270 vector autoregression 586 605 lag length 564 565 589 644 646 lag operator AR 1 process 610 ARIMA model 632 models 562 564 571 594 polynomials in 596 613 Lagrange multiplier LM statistic as alternative 350 autocorrelation and 271 582 CAPM model 353 355 computing 679 940 convenience of 141 count data models 741 GARCH effects 244 GMM 550 551 Grunfeld investment model 332 heteroscedasticity 769 hypothesis testing 177 178 496 678 labor force participation model 682 likelihood function and 494 495 likelihood ratio test and 324 LM test example 500 MLE 501 model based tests 230 negative binomial model 744 Poisson regression model 746 747 problem with approach 139 simultaneous equations models 413 414 SUR model example 351 testing correlation 712 testing for homoscedasticity 681 theorems 489 tobit model 775 Lagrange multiplier LM test application example 492 autocorrelation and 269 271 basis of 177 Box Pierce test and 270 Breusch Pagan LM test 223 225 769 discrete choice models 678 679

    1011

    K
    k class 401 403 411n23 Kaplan Meier estimator 797 798 kernel density estimator depicted 911 discrete choice model 704 705 dynamic models 709 estimation 456 465 income 881 882 latent class model example 443 MSL estimation 516 517 nonparametric regression function 706 708 semiparametric estimation 452 453 as substitute for histogram 881 kernel function 706 Keynes consumption function 1 2 8 9 587 Khincine theorem 69 76 463 527 900 Klein s Model I adjustment to equilibrium 419 comparison of methods 411 413 forecasting and 587 identi cation 390 stability 417 statistical tables 950 testing overidentifying restrictions 415 knots 120 122 Kolmogorov s theorem 901 902 Kronecker product 342 824 825 Kruskal s theorem 214 343n7 kurtosis 772 848 879

    L
    labor studies application data set 947 with education 87 90 participation example 664 709 710 supply example 782 lack of invariance 110 LAD See least absolute deviations LAD lagged variables autoregression and 610

    functional form 143 GMM 548 551 Grunfeld investment model 330 Hausman test 303 heteroscedasticity 223 224 328 769 hypothesis testing 327 465 484 489 490 inference 100 101 108 115 lack of invariance and 110 linear regression model 495 for log linearity 500 501 maximum likelihood estimation and 298 model based tests 230 Nerlove s study 125 nonnormality 771 omitted variables 680 overdispersion 743 random effects model 298 speci cation tests 682 stationary stochastic processes 609n2 structural changes and 602 time series models 622n13 Wald test and 486 lag weights 562 567 573 lag window 628 latent class model 426 439 443 516 517 latent regression 668 670 latent roots 827 latent vectors 827 laws associative law 806 cosine law 819 distributive law 806 of large numbers 900 902 strong law of large numbers 901 902 weak law of large numbers 900 leading term approximation 917 least absolute deviations LAD bootstrapping 925 limited models 771 as M estimator 465 semiparametric estimation 448 450 least squares See also weighted least squares WLS asymptotic normality of 260 as ef cient estimator 572 estimator 41 42 as extremum estimator 461 FGLS and 211 nite sample properties of 55 56 F statistic 95 99 220 GMM estimation 165 548 goodness of t 31 38

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1012

    Subject Index Leibnitz theorem 475 L Hopital s rule 174 500 Liapounov See Lyapounov central limit theorem likelihood equation 472 476 670 likelihood function 431 468 470 472 494 888 889 likelihood inequality 477 likelihood ratio LR statistic as alternative 349 Cox test and 155 GMM 550 593 Grunfeld investment model 332 hypothesis testing 327 678 misspeci cation and 770 MSL estimation 514 SUR model example 350 VAR testing and 593 likelihood ratio LR test application example 490 492 discrete choice models 678 GMM 548 551 heteroscedasticity 230 237 hypothesis testing and 327 484 486 LIML and 413 linear regression model 494 LM statistic and 324 MLE and 329 Poisson distribution 745 Poisson regression model 746 testing hypotheses 465 univariate model and 686 LIMDEP computer program 244n31 377 limited information maximum likelihood LIML discrete choice models 726 example 523 Klein s Model I 412 methods of estimation 396 404 speci cation tests 413 414 travel mode choice example 732 two step maximum likelihood estimation 509 limiting distribution central limit theorem 532 convergence in distribution 906 908 F statistic 106 for function 913 914 probability theory 853 random variables 107 LIML See limited information maximum likelihood LIML Lindberg condition 910 Lindberg Feller central limit theorem distribution theory 909 913 estimation frameworks 463 generalized method of moments 532 542 generalized regression model 195 203 least squares 67 68 serial correlation 262 263 Lindberg Levy central limit theorem distribution theory 909 910 912 estimation 889 generalized method of moments 527 generalized regression model 203 least squares 67n3 MLE 478 506 serial correlation 262 263 linear approximation 837 838 linear association 37 867 linear combination 806 811 linear dependence 811 linear equations 819 822 linear forms 876 877 linear function 838 869 870 873 linear independence 812 linearity 10 13 42 72 linearly deterministic component 619 linearly indeterministic component 620 linear regression model assumptions 10 18 42 Box Cox transformation and 173 175 characteristics 7 10 classical normal 872 873 coef cient of determination for 37 comparing 152 Cox test and 156 discrete choice models 663 estimated money demand equations 180 example 428 429 F statistic and 175 functions for 93 Gauss Markov theorem 47 48 incidental parameters problem and 697 linear restrictions 94 maximum likelihood estimation 492 496 omitted variables 679 probability theory 866 867 testing for heteroscedasticity 508 theorem 185

    least squares continued groupwise heteroscedasticity 237 inconsistent models for 75 inef ciency of 217 instrumental variables estimation 192 197 lag models and 559 566 least squares regression 19 25 matrix algebra 817 as maximum likelihood estimator 518 as modeling framework 465 optimal linear predictor 43 44 partial correlation coef cients 28 31 partialing out 27 28 partitioned regression 26 27 population orthogonality conditions 42 43 problem 934 testing model instability 660 least squares dummy variable LSDV model autocorrelation and 317 FGLS and 297 xed effects 287 Hausman test 302 panel data 289 return to schooling example 306 time effects and 291 least squares estimator See also nonlinear least squares estimator asymptotic distribution and 105 asymptotic properties 65 74 265 267 bias 679 CAPM model and 356 covariance matrix 217 219 distinctions 80 estimating variance of 48 49 266 267 re order conditions for 165 gamma model 129 Gauss Markov theorem 45 48 GMM estimator and 540 identi cation and 463 large sample properties 167 169 maximum likelihood 156 588 nonstochastic regressors and 45 46 production function 449 450 as robust estimator 590 serial correlation 265 267 tobit model 766 775 truncated regression model 761 twins study 88 unbiased estimation 44 45 151 least variance ratio 402

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index linear restrictions on coef cients 122 likelihood function 494 linear regression model and 94 structures and 390 line search 935 936 945 LISREL computer program 919n1 Ljung Box statistic 622 Ljung s re nement 269 270 LM statistic See Lagrange multiplier LM statistic LM test See Lagrange multiplier LM test local maxima 840 local optima 840 location 878 logistic distribution 667 668 855 logit kernel 455 456 logit model derivatives 675 distribution for 667 xed effects models 698 heteroscedasticity 680 LS estimator 701 for multiple choices 719 735 name origin 687 nonlinear regression models 186 normal distribution 691 as probability model 675 state dependence 708 two step MLE 511 weighted least squares 688 log likelihood function See also concentrated log likelihood AR 1 model 273 binary choice models 688 Butler and Mof tt method 694 xed effects model 695 groupwise heteroscedasticity 236 maximum likelihood estimation and 326 347 multiplicative heteroscedasticity 233 multivariate regression model 358 probability models and 675 testing hypothesis 681 two step estimation 231 232 loglinear model Cobb Douglas model 12 coef cient of determination for 37 count data 740 Cox test and 156 depicted 11 12 estimated money demand equations 180 gasoline consumption study 132 Lagrange multiplier test 500 501 Nerlove s study 125 126 regression model and 122 123 testing linear speci cation 179 lognormal distribution 771 794 854 931 lognormal variables 854 longitudinal data sets 283 284 320 Longley data 58 61 948 long run multiplier 561 562 564 loss function 434 loss of t 95 101 104 lower triangle 803 lower triangular matrix 832 922n4 932 LR statistic See likelihood ratio LR statistic LR test See likelihood ratio LR test LSDV model See least squares dummy variable LSDV model LSQ procedure 919 1 Lucas critique 587 Lyapounov central limit theorem 195 262 263 463 483 542 912

    1013

    M
    M estimator 461 463 465 521 M variables 654 655 MA process See moving average MA process macroeconometrics data set 948 distinctions 5 forecasting performance 608 macroeconomics consumption function and 381 deterministic relationships 7 example 380 forecasting and 587n9 unit roots and data 636 vector autoregressions and 586 587 macroeconomic variables rational lag model and 575 time series models 649 VAR model 596 main diagonal 803 marginal distributions 860 871 872 marginal effects binary choice models 665 674 676 705 712 713 binary variable 740 in censored regression model 765 censoring and truncation 774 computing 668

    functional form 124 labor force participation model 682 lagged variables 560 probability model example 676 recursive model 716 tobit model example 766 truncated regression model 760 marginal moments 865 867 marginal probability density 860 marginal propensity to consume MPC Bayesian estimation 437 consumption function 172 173 distributed lag model 109 110 Keynes consumption function 2 Markov Chain Monte Carlo MCMC method 426 444 447 513 920 923 Markov s inequality 898 903 Markov s theorem 69 902 martingale difference sequence 263 273 463 542 martingale sequence 262 Matlab computer program 631 matrices comparing 836 837 condition number of 829 determinant of 816 817 830 diagonalization of 827 factoring 832 833 generalized inverse of 833 834 powers of 830 832 rank of 814 816 827 829 874 spectral decomposition 827 832 trace of 829 830 matrix algebra algebraic manipulation of matrices 803 809 calculus 837 845 Cox test and 156 158 geometry of matrices 809 819 idempotent matrix 808 809 linear equations 819 822 matrix addition 804 805 matrix multiplication 805 807 matrix product rule 904 partitioned matrices 822 825 quadratic forms 834 837 roots and vectors 825 834 sums of values 807 808 systems of linear equations 819 822 terminology 803 two way effects 291n9 usefulness of 23 matrix inverse rule 904 matrix power 830 832 matrix weighted average 290

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1014

    Subject Index nonlinear systems 371 372 overdispersion 743 panel data 326 329 Poisson model 742 predictions and 686 principle of 470 probit models 711 712 properties of MLE 472 483 QMLE 246 673 674 random effects and 299 serial correlation 274 stochastic frontier model 429 structural breaks and 141 SUR model 347 351 357 360 test procedures 484 492 theorem 164 tobit model 775 777 truncated regression model 761 two step estimation 184 508 512 maximum score estimator MSCORE 685 702 706 703n42 maximum simulated likelihood MSL 512 517 693 MCSE minimum chi squared estimator 687 689 mean asymptotic distribution of 915 deviations from 824 estimation 878 of functions 900 of lognormal distribution 931 Monte Carlo study 923 924 one sided test 896 of random variables 847 testing hypothesis 893 895 truncated mean 759 mean absolute error 113 mean lag 562 564 mean square convergence 67 69 897 898 mean squared deviation matrix 702 mean squared error MSE 43 44 150 887 mean value theorem 543 measure of central tendency 847 of closeness 19 condition number 56 57 of linear association 867 of model t 644 measurement errors in 8 lacking for variables 84 standard deviation units 123 measurement error 75 83 90 median 847 878 916 923 924 median lag 562 MELO minimum expected loss estimator 434 435 method of kernels 706 method of moment generating functions 529 method of moments 429 447 526 533 536 540 943 method of moments estimators 528 531 533 535 See also generalized method of moments GMM estimator method of scoring 672 723 939 941 methods of estimation GMM estimation 400 401 409 410 instrumental variables 75 397 398 limited information 396 404 ordinary least squares 396 397 simultaneous equations models 396 system methods of estimation 404 411 two stage least squares 398 400 Metropolis Hastings algorithm 445 446 MGF moment generating function 859 microeconometrics 5 125 microeconomics 602 605 minimal suf cient statistic 697 minimum distance estimator 205 206 538 539 minimum expected loss MELO estimator 434 minimum mean squared error 43 44 minimum variance linear unbiased estimator MVLUE 44 890 minimum variance unbiasedness 887 minor 817 missing observations 59 60 misspeci cation 250 251 770 mixed estimation 436n9 mixed logit model See random parameters logit RPL model MLE See maximum likelihood estimator MLE MNP See multinomial logit MNL model MNP multinomial probit model 727 728 models autoregressive distributed lag models 571 579 distributed lag models 565 571

    maximum likelihood estimator MLE See also full information maximum likelihood FIML pseudo MLE applications of 492 508 approximating 768 AR 1 model 273 aspects of 939 941 asymptotic covariance matrix 672 673 688 asymptotic properties 476 482 689 autocorrelation and 275 bias in 697 binary choice models 670 710 712 CAPM model and 357 cautions 239n24 CMLE 699 consistency and 690 dependent variables and 686 determinants and derivatives 840 discrete choice models 663 disturbances and 679 duration models 794 795 ef cient estimation 211 470 472 526 estimating probabilities 714 estimation 426 428 example 128 as extremum estimator 461 xed effects models 697 gamma distribution 530 gamma model 129 GARCH model 242 245 GMM estimation 540 548 549 grouped data 689 groupwise heteroscedasticity 236 237 Grunfeld investment model 331 heteroscedasticity 228 229 identi cation 463 invariance of 359 Lagrange multiplier statistic 940 Lagrange multiplier test 298 least squares estimator 65 71 156 588 likelihood function 468 470 linear regression model 492 496 log likelihood function 688 maximum simulated likelihood 512 517 MCSE and 688 as modeling framework 465 multiplicative heteroscedasticity 235 nonlinear regression models 496 501

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index dynamic models 558 564 579 586 exogenous variables and 591 general to simple strategy 151 152 for panel data 283 286 and prediction 7 selection of 148 161 simple to general approach 151 speci cation analysis and 148 152 structural form 130 134 382 tests 229 232 tests of stability 134 143 659 660 univariate time series 619 621 vector autoregression 586 605 moment generating function MGF 859 moments asymptotic moments 918 central moments 529 848 conditional moments 505 508 743 771 865 867 convergence of 203 260 262 905 empirical moments 202 541 542 marginal moments 865 867 probability theory 866 868 869 and random variables 614 uncentered moment 527 money demand cointegrating vector 652 653 cointegration 657 660 data 951 example 180 250 instability of 142 143 as macroeconomic variable 649 Monte Carlo integration 715 Monte Carlo methods Bayesian estimator and 430 data sets 920 functional form 141 testing unit roots 637 Monte Carlo studies AR 1 model 274 computation 920 features 923 924 GNP de ator 634 heteroscedasticity 246n35 681 least squares estimator 59 60 replicating data 921 systems estimators 413 White estimator 220 Moore Penrose generalized inverse 83 833 834 most powerful test 893 moving average 240 610 614 See also vector moving average VMA moving average form 258 563 598 611 moving average MA process 257 318 616 618 MSCORE See maximum score estimator MSCORE MSE See mean squared error MSE MSL See maximum simulated likelihood MSL multicollinearity absence of 163 542 data problems 56 59 dummy variables and 118 nonlinear regression models 173 multinomial logit MNL model 720 723 728 732 734 multinomial probit MNP model 727 728 multinormal integrals 690n26 multiple correlation 36 multiple linear regression model 7 10 multiple regression 21 23 35 88 multiplication 805 807 809 823 multiplicative heteroscedasticity 232 235 239n24 multipliers 415 417 420 561 562 multivariate Lindberg Feller central limit theorem 913 multivariate Lindberg Levy central limit theorem 912 multivariate normal distribution 871 877 multivariate normal population 922 multivariate normal probabilities 931 933 multivariate probit models 710 719 multivariate regression model 340 358 362 multivariate standard normal 871 multivariate t distribution 434 436 MVLUE minimum variance linear unbiased estimator 44 890

    1015

    N
    naive predictor 685 686 National Institute of Standards and Technology NIST 833n12 National Longitudinal Survey of Labor Market Experience NLS 283 nearest neighbor 457 negative autocorrelation 251 253 647 negative binomial model 744 745 747 774

    negative de nite matrix 834 835 negative duration dependence 794 nested logit models 725 727 nested models 93 95 netting out 27 28 Newey West covariance matrix 267 628 Newey West estimator functional form 142 generalized method of moments 544 546 generalized regression model 200 201 206 panel data 316 regression equations 373 serial correlation 280 Newton s method computation 937 939 944 945 discrete choice models 672 696 723 741 limited models 767 797 New York Stock Exchange 240 Neyman Pearson methodology 153 892 NLS National Longitudinal Survey of Labor Market Experience 305 nonautocorrelation assumptions 324 325 CAPM model 356 de ned 15 error correction and 581 nonlinear model assumption 163 regression models 10 42 72 noncentral chi squared distribution 487n8 noncentral F distribution 852 853 nonconstructive test 223 nonhomogeneous equation system 820 822 noninformative prior 431 noninvariance of Wald test 110 nonlinear instrumental variable estimator 183 545 nonlinearity 122 130 nonlinear least squares asymptotic properties of 196 geometric lag model and 568 as modeling framework 465 nonlinear regression models 496 two step estimation 183 186 nonlinear least squares estimator computing 169 170 consistency of 168 production function example 499 properties 193 196 solving explicitly 934n21

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1016

    Subject Index normal distribution censored data 762 764 conditional normal distributions 871 872 con dence intervals and 55 depicted 50 disturbances and 17 164 518 features 849 850 information matrix 479 likelihood function 472 888 889 limiting for function 913 914 linear regression model assumptions 10 42 logit model 691 mixtures of 529 MSL and 693 multinomial models and 727 728 nonlinear restrictions and 109 normit 687 probit model 666 sample moments and 203 sampling and 531 spherical disturbance and 16n3 theorem 876 truncated normal distribution 757 VAR testing and 593 normal equations 21 24 normal gamma prior 436 normality assumptions 17 50 55 110 Breusch Pagan LM test 224 Butler and Mof tt method 693 central limit theorem 262 265 least squares 65 linear regression model 17 selection model 789 t distribution value and 106 VAR testing 590 Wald statistic and 110 356 normalization 163 383 389 390 470 669 825 normit 687 null hypothesis ARCH model 244 average log density 155 Box Pierce test 269 CAPM model 354 Cox test 158 CUSUM test 136 of equality 289 F statistic 106 groupwise heteroscedasticity 236 Grunfeld investment model 331 332 Hausman test 301 771 of homogeneity 699 of homoscedasticity 224 of interest 80 81 Lagrange multiplier test 299 model constancy 141 no natural 153 nonconstructive test 223 normal distribution 105 signi cance test for restrictions 176 structural break and 133 test equivalence 484 testing 323 testing restrictions 95 98 Type I and II errors 892 Wald statistic and 133 590 null matrix 804

    nonlinear models alternative estimators for 180 189 applications 171 175 Cox test and 156 error correction and 580 general forms 162 171 hypothesis testing and parametric restrictions 175 180 industry structure example 404 maximum likelihood estimation 496 501 Poisson model as 740 nonlinear restriction 104 108 110 130 nonlinear systems GLS estimation 370 371 GMM estimation 369 373 maximum likelihood estimation 371 372 simultaneous equation models 382n6 two stage least squares and 403 404 nonlinear weighted least squares 687 nonnegative de nite matrix 832 834 836 nonnested models choosing between 152 159 speci cation tests 682 683 testable implications and 94 test statistic for 751 nonnormality large sample tests and 104 108 speci cation issues 771 773 nonparametric estimation econometrics literature 708 estimation frameworks 425 453 459 extremum estimators and 461 nonparametric regression 457 459 nonpositive de nite matrix 834 835 nonsample information 388 394 nonsingular matrix 821 nonspherical disturbances 191 214 314 318 nonstationarity 632 647 650 nonstationary process 631 649 nonstochastic regressors ambiguity and 590 591 data generation process 16 nite sample properties theorem 193 least squares estimator and 45 46 nonstructural models 379n2

    O
    Oaxaca s decomposition 53 54 Oberhofer Kmenta algorithm 299 Oberhofer Kmenta conditions 347 349 observationally equivalent theories 385 386 observations See also independent observations with constant terms 272 dependent observations 73 74 260 541 deterministic theories and 3 dummy variables and 117 economic analysis and 624 exercise 39 Goldfeld Quandt test 223 identically distributed 468 insuf cient observations 131 132 missing observations 59 60 panel data set 72 73 Poisson regression model 745 time series process and 254 weighting of 128 OLS See ordinary least squares OLS Olsen s reparameterization 767 omission of relevant variables 148 149 151 omitted variable formula 148 149 omitted variables 673 679 680 one period ahead forecast 576 one step ahead prediction error 135 one step estimation 939 940 one to one function 844 OPG outer product of gradients estimator 481 optimality 663 optimal linear predictor 43 44 optimization computation and 919 946 constrained optimization 842 843

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index criterion function and 461 matrix algebra 840 order 803 837 918 919 order condition 203 389 394 404 542 ordered choice models 719 ordinary least squares OLS absolute value and 768 Aitken estimator and 207 classical regression 341 common factor restrictions and 585 FGLS and 322 nite sample properties 193 194 GARCH model and 244 generalized regression model 521 GLS estimator 342 343 GMM estimator 221 588 groupwise heteroscedasticity 235 236 Grunfeld investment model 331 heteroscedasticity 216 221 Klein s Model I 411 413 maximum likelihood estimation 211 method of estimation 396 397 as method of moments estimator 535 multiplicative heteroscedasticity 234 random effects model and 316 rational lag model example 575 standard error 323 superconsistency 656 testing unit roots 637 truncated regression model 761 White estimator 220 orthogonality conditions disturbances and 167 estimation based on 534 536 GMM estimator 182 314 409 540 545 method of moments estimation and 201 overidenti cation by 548 sum of squares and 164 165 orthogonal random variables 57 59 orthogonal regression 23 orthogonal vectors 818 827 orthonormal quadratic form 873 outcomes 845 outer product matrix 828 outer product of gradients OPG estimator 481 outliers 60 output 650 656 951 overdifferencing 647 overdispersion 743 744 746 751 779 overidenti cation 414 415 overidenti ed 130 548 overidentifying restrictions 175 548

    1017

    P
    p value 649 PACF partial autocorrelation function 618 622 624 panel data See also dynamic panel data model characteristics of 192 covariance structures 320 334 data set 878 estimated models 701 xed effects model 287 293 694 700 instrumental variables estimator 303 306 microeconometrics and 5 models for 283 286 nonspherical disturbances and robust covariance matrix 314 318 for Poisson model 747 749 random coef cients models 318 319 random effects model 293 303 689 694 samples in 72 73 state dependence and 708 Panel Study of Income Dynamics PSID 283 305 709 parameters See also random parameters comparison of estimators 128 con dence intervals for 52 53 criterion function and 461 of distributions 527 531 538 540 empirical moment equation and 202 estimable parameters 469 estimation with unknown 227 232 exactly identi ed 129 function of one parameter 943 944 hypothesis testing and 93 identi ability of 462 463 identi cation of 468 470 Lucas critique 587 overidenti ed 130 parameter space 427 point estimation 885 890 precision parameter 480 probability limits 526 527

    restrictions 175 180 system of demand equations 362 two step estimation and 183 186 univariate time series 621 624 parameter space 94 427 461 483 parameter vectors condition moment tests 507 functional form 130 131 Gauss Newton method 169 GMM estimation 540 identi ability 163 identi cation 463 latent class model example 442 LM test statistic 678 MSL estimation 513 parameter space and 461 structural break tests 133 testing model instability 659 two step MLE 508 Wald criterion and 139 parametric estimation Bayesian estimation 429 439 classical likelihood based estimation 428 429 estimation frameworks 425 hierarchical Bayes estimation 444 447 hypothesis testing 437 439 interval estimation 435 latent class model 439 443 point estimation 434 435 parametric models 192 451 708 792 798 partial adjustment model 568 partial autocorrelation 617 619 partial autocorrelation function PACF 618 622 624 partial correlation 36 partial correlation coef cients 28 31 partial differences 272 partialing out 27 28 partial likelihood 799 partially linear model 450 partially linear regression 450 452 partial regression coef cients 28 31 partitioned inverse 100 695 824 partitioned matrices 822 825 partitioned regression 26 27 118 300 Parzen kernel 455 Parzen window 628 PC GIVE computer program 409n22 pdf probability density function 468 857 882 906 PDL polynomial distributed lag 566 permanent income 8 84 525 548

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1018

    Subject Index precision parameter 480 predetermined variables 380 382 393 prediction linear regression models and 93 maximum likelihood estimator 686 models and 7 nonlinear models 186 with probit model 686 regression and 111 114 prediction criterion 36 160 prediction interval 111 576 prediction variance 111 112 predictive test 137 139 premultiplication 805 pretest estimator 149 150 152 price variable cointegrating vector example 652 653 de ator data 951 as dependent variable 8n1 economic analysis and 624 as economic variable 631 as macroeconomic variable 649 principal components 58 principal minor 817n4 prior beliefs 430 435n7 prior distribution 431 432 prior odds ratio 438 prior probabilities 438 probability convergence in 897 905 distribution theory 845 877 elasticities of 723 size of test and 893 probability density function pdf 468 846 906 930 probability distribution nonlinear model 163 164 ordered data 738 population regression 19 probability theory 845 846 representations of 858 859 speci c 849 856 probability limit 69 218 219 526 536 904 probability models 664 666 667 675 676 probit 666 687 probit model bivariate probit models 710 719 derivatives 675 heteroscedasticity 680 LS estimator 701 MSL estimation 515 516 prediction with 686 probability model 675 676 with random effects 694 structural equations for 669 670 weighted least squares 688 problem of identi cation 378 380 385 395 production function constant elasticity of substitution 129 deterministic relationships 7 deviations from 502 example 102 104 generalized example 498 499 LAD estimation 449 450 problems analyzing 284 stochastic frontier model 505 production models 12 339 product limit estimator 798 product rule 904 projection 24 25 819 projection matrix 24 60 properties See also asymptotic properties of dynamic models 415 420 of estimators 460 465 of GLS estimator 208 of GMM estimator 540 544 of MLE 472 483 statistical properties of estimators 460 proportional hazard 799 proportions data 686 689 proxy variables 87 88 pseudodifferences 272 pseudoinverse 833 pseudo MLE 245 246 356 518 521 pseudo random number generators 920 921 pseudoregressors 167 169 182 500 PSI See Personalized System of Instruction PSI PSID Panel Study of Income Dynamics 283 305 709 Public Use Sample 721 purchasing power parity theory 650 652

    persistence 635 708 709 729 Personalized System of Instruction PSI 675 PE test 178 180 Phillips curve 251 253 568 570 Phillips Perron statistic 646 Phillips Perron test 644 645 piecewise continuous 122 pivotal quantity 891 point estimate 434 435 595 885 890 Poisson distribution distribution theory 856 maximum likelihood model and 470 471 MLE 485 486 Poisson regression model and 740 two step MLE 511 variance bound for 889 890 Poisson model application 745 747 censoring and truncation 774 censoring application 774 780 count data 740 latent class model example 441 MLE 521 742 negative binomial model and 744 nonlinear regression models 187 189 overdispersion 743 for panel data 747 749 zero altered poisson model 749 752 polynomial distributed lag PDL 566 polynomials ARIMA model 632 Hermite polynomials 926 inversion in lag operator 613 invertible polynomial 564 in lag operator 563 571 596 Taylor series as 837 pooled regression 285 289 328 population moment equation 201 population quantity 19 population regression 19 20 42 population regression equation 7 positive de nite matrix 831 834 835 837 positive duration dependence 794 positive semide nite matrix 834 posterior density 430 posterior odds ratio 438 439 postmultiplication 805 Prais Winsten estimator 273 276 318 325 326 360 precision as statistical property 41 precision matrices 439

    Q
    Q test 269 271 QMLE See quasi maximum likelihood estimator QMLE QR models See qualitative response QR models quadratic approximation 837 quadratic forms full rank quadratic form 875 876 independence of 876 877

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index matrix algebra 834 837 839 orthonormal quadratic form 873 quadratic hill climbing method 938 quadrature 692 694 928 929 quali cation indices 145 qualitative choices 664 qualitative response QR models discrete choice models 663 econometrics 689 NIST and 833n12 selection in 790 quantile regression 448 quasi differences 272 quasi maximum likelihood estimator QMLE 246 673 674 quasi Newton methods 938 939 data generating mechanism 427 distribution of function 856 858 distribution theory 845 846 expectations 847 849 hazard function 859 of interest 155 limiting distribution of 107 moments and 614 probability density function 468 probability model example 676 random effects model 690 random vector and 868 theorem 22 1 757 variance of 847 random vector 868 random walk common trends and 653 Dickey Fuller tests 637 643 nonstationary processes 632 636 random walk with drift model 572 serial correlation 263 random walk with drift ADF GLS procedure 645 deterministic trends 635 Dickey Fuller tests 639 643 model 572 money demand example 657 658 nonstationary processes 631 634 testing 636 rank of a matrix 814 816 827 829 874 of a product 828 of a symmetric matrix 828 832 rank condition generalized regression model 203 GMM estimation 542 identi cation 389 394 nonlinear model example 404 ranking 664 rank two correction 939 rank two update 939 rate of in ation Dickey Fuller test 638 investment equation and 21 Phillips curve 251 testable implications and 93 94 rational lag model 573 575 ratio rule 904 RATS computer program 244n31 631 recursion computations and 698n31 recursive model 383 715 716 recursive residual 135 137 recursive systems 411

    1019

    R
    R2 coef cient of determination 439 See also sum of squared residuals generalized regression model and 209 Grunfeld investment model 330 hypothesis testing 678 Poisson model 741 Theil U statistic and 113 random coef cients 318 319 random effects model heteroscedasticity 316 317 instrumental variables estimator 303 306 panel data and 285 293 303 689 694 persistence and 729 random number generators 920 random parameters logit models and 728 729 MSL estimation 516 517 panel data and 285 286 random parameters logit RPL model 728 729 734 random parameters model 700 random sample consistent estimation 527 531 descriptive statistics 880 method of moments 526 529 multivariate probit models and 714 regression estimation 621 samples and 878 random utility model 670 719 random variables bivariate random variables 862 864 censored example 763 764 convergence to 904 905

    reduced form 379 380 384 415 416 598 reduced form disturbances 384 reduced form equation 87 regressand 7 regression binary choice models 665 668 binary variables in 116 117 changes in R2 34 conditional mean 15 864 with constant term 28 constant term and 37 diagnostics for 60 61 dummy variables in 117 duration model 792 798 gamma model estimates 129 individual coef cients 27 linearity and 11 linear regression model and 14 15 median regression 448 multiple correlation 36 and prediction 111 114 probability models and 667 residual variance in 866 in selection model 782 784 testing signi cance of 54 55 variables added to 30 without constants exercise 40 regression analysis forecasting and 33 Frisch Waugh theorem and 38 projection matrix and 24 test statistics 51 regression models assumptions 75 76 asymptotic properties of 72 constant terms and 15 function form 116 interaction term and 123 124 loglinear model 122 123 nonspherical disturbances 191 214 prediction and 111n8 semiparametric analysis 50 truncation 760 761 regression variance 867 regressors cointegration 652 data generation 16 17 determining appropriateness of 152 identical regressors 343 344 LSDV estimators and 298 nonlinearity and 124 nonstochastic regressors 16 45 46 193 590 591 population regression equation 7

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1020

    Subject Index robust estimation of asymptotic covariance matrix 198 201 xed effects model 314 316 generalized regression model 192 GMM 312 534 heteroscedastic regression model 216 least squares as 590 nonnormality 771 robustness to unknown heteroscedasticity 226 root mean squared error 113 602 root n consistent 909 roots 825 834 row rank 815 row space 815 row vector 803 RPL random parameters logit model 728 729 734 rules for limiting distributions 907 908 matrix multiplication 806 807 for probability limits 904 seemingly unrelated regressions SUR model autocorrelation and heteroscedasticity 360 362 capital asset pricing model 351 357 feasible GLS 344 347 generalized least squares 341 343 identical regressors and 343 344 linkages in disturbances 342 maximum likelihood estimation 347 351 357 360 selection model estimation in 784 787 normality assumption 789 qualitative response models 790 regression in 782 784 treatment effects 787 789 semilog earnings equation 664 semilog model 12 116 123 semiparametric 50 164 192 700 702 semiparametric estimation binary choice models 452 453 690 704 706 estimation frameworks 425 extremum estimators and 461 xed effects model 699 LAD estimation 448 450 partially linear regression 450 452 semiparametric model 799 serial correlation common factor model 278 279 disturbance processes 256 259 ef cient estimator 271 273 estimation 273 277 examples 250 253 forecasting with autocorrelation 279 280 GMM estimator 268 least squares estimator 265 267 testing for autocorrelation 268 271 time series data 253 256 259 265 Shazam computer program 244n31 short rank 13 14 83 815 816 shuf ing 920 SIC33 primary metals industry 102n5 948 signature of the matrix 827 signi cance level 893 signi cance of group effects 289 signi cance tests 175 177 simple to general approach 151 564 583

    regressors continued pseudoregressors 167 169 182 500 stochastic regressors 47 48 74 280 307 theorem 186 transformed 210 zero values 174 regularity conditions 473 474 rejection region 892 relationships cointegration relationships 658 conditional moments 865 867 deterministic relationships 2 7 8 earnings and education 9 10 equilibrium relationship 579 655 income and consumption 8 linear relationships 13 marginal moments 865 867 persistence 635 and random walk 633 between variables 665 reliability ratio 88 residual analysis of variance table example 34 detrending and 635 dummy variable model 316 estimating equilibrium error 655 misspeci ed models 252 population regression and 19 in regressions 29 two step MLE 512 residual based tests 229 230 residual maker 24 39 residual variance 866 867 response treatment and 117 restricted least squares 99 104 restrictions common factor restrictions 583 586 degrees of freedom and 110 489 593 678 on disturbance covariance matrix 390 of equal correlation across periods 693 nested models and 93 95 normalization and 383 signi cance tests for 175 177 testing 172 414 415 484 548 551 ridge regression estimator 58 risk set 798 robust covariance estimations 673 674 robust covariance matrix 314 318 518 521

    S
    sacri ce ratio 596 597 sample midrange 878 sample minimum 883 898 sample periodogram 627 628 sample selection 713 714 756 781 782 sample variance 887 888 918 sampling 878 915 921 922 sampling distribution 42 44 45 882 885 See also normal distribution sampling variance 46 47 128 885 SAS computer program 409n22 scalar addition 809 scalar matrix 803 scalar multiplication 805 807 809 scalar valued function 838 scale 878 scaling 112 113 728 scatter diagram 879 882 scedastic function 865 Schwartz criterion 160 565 589 644 score test 489 score vector 476 scoring method See method of scoring seasonal adjustment 649 second derivatives matrix Hessian 838 second order effects 12

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index Simpson s rule 928 simulated annealing 935 simulated moments 931 933 simulation 693 simulation based estimation 426 simultaneous equations bias 379n3 396 simultaneous equations models discrete choice models 716 dynamic models 415 420 fundamental issues 378 385 Klein s Model I 411 413 single equation 396 404 speci cation tests 413 415 system methods of estimation 404 411 VARs and 586 588 597 single equation 396 404 414 singular systems 362 369 singular value decomposition 833 size distributions 854 893 skewness choice based sampling and 673 estimation 454 879 nonnormality 772 Slutsky theorem discrete choice models 668 distribution theory 903 904 generalized method of moments 527 538 generalized regression model 204 least squares 70 85 nonlinear regression models 184 serial correlation 265 simultaneous equations models 399 smoothing function 452 457 458 software packages 244n31 377 409n22 631 Solow s technological change data 949 spanning vectors 813 speci cation 564 565 664 768 773 speci cation analysis 148 161 582 583 speci cation error 783 speci cation tests binary choice models 679 683 conditional moments 505 508 distribution theory 896 GMM estimation 549 Hausman s speci cation test 80 83 301 303 for nonlinear regressions 178 180 panel data 323 324 simultaneous equations models 413 415 speci city 924 spectral analysis 624 628 631 See also frequency domain spectral decomposition 827 832 spectral density function 625 627 spherical disturbances 15 16 192 193 spline function 120 square matrix 803 816 817 840 square summable 619 stability of demand equations 658 of dynamic equation 573 576 dynamic models 417 418 impulse response functions 593 594 testing model 134 143 vector autoregression and 602 Standard and Poor s Index 240 standard deviation 647 848 878 standard deviation units 123 standard error bootstrapping method 113 comparison of estimators 705 delta method 128 172 173 175 776 estimation 885 GMM estimator and 540 Grunfeld investment model 330 LAD estimation 450 least squares 49 267 MCS estimator and 689 OLS estimation 323 Phillips curve example 569 probability model example 676 Taylor series 741 standard error of the regression 49 standard normal cumulative distribution function 926 standard normal distribution 850 standard normal vector 873 874 876 starting values 171 Stata computer program 244n31 377 742 state changes time effects as 283 state dependence 690 708 710 stationarity ARMA model 611 614 economic variables and 631 manipulating series 649 MLE 483 models with lagged variables 559 regression models and 649 serial correlation 256 258 261 stationary 559 stationary conditions 241 242 stationary process 74 609 631 statistic 882

    1021

    statistical interference 50 55 200 201 statistical properties 41 47 statistical tables 953 958 statistics estimation and inference 878 882 as estimators 882 885 suf cient statistics 530 steepest ascent method 937 stepwise model building 152 Sterling s approximation 928 stochastic elements 3 8 688 stochastic frontier model 429 501 505 stochastic regressors 47 48 74 280 307 stochastic volatility 238 stock market returns ARCH model and 238 economic analysis and 624 fast Fourier transform 631 long memory models 647 Stone s expenditure system 362 strike duration data 952 strong exogeneity 591 strong law of large numbers 901 902 strongly exogenous variables 382 strongly trended 632 635 strong stationarity 260 612n5 structural breaks gasoline consumption study 136 137 model constancy and 141 modeling for 130 134 testing model instability 660 unknown timing of 139 143 structural change 116 147 structural disturbances 382 384 structural equations 379 669 670 structural models 87 379n2 559 560 587 595 597 structural VARs 595 600 structures 386 390 Student s t distribution 954 submatrices 822 subspace 813 substitution 12 594 suf cient condition 840 suf cient statistics 530 summability 264 sum of squared residuals as criterion 31 Lagrange multiplier test 177 least squares regression and 299 testing procedures based on 159

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1022

    Subject Index distribution theory 849 GMM estimation 543 inference 109 least squares 70 limiting distribution and 914 method of moments estimator 532 Newton s method 937 regression models 165 166 178 179 regularity conditions 474 second order 367 standard error 741 testable implications 93 94 testing See also hypothesis testing speci cation tests aggregation bias 341 for autocorrelation 268 271 based on LM statistic 177 178 causality 590 593 classical procedures 892 895 for cointegration 655 657 common factors 278 279 585 586 con dence intervals 895 896 consistent test 894 895 CUSUM test 135 137 for GARCH effects 244 245 GPH test 649 for heteroscedasticity 222 225 508 for homoscedasticity 506 maximum likelihood estimation 484 492 model stability 134 143 659 660 nonlinear restriction 108 110 nonnormal disturbances and 104 108 for overdispersion 743 744 overidentifying restrictions 414 415 Phillips Perron test 644 predictive test 137 139 for random effects 298 301 recursive residuals 135 137 restrictions 172 score test 489 signi cance tests 175 177 size of 893 for structural breaks 130 134 summary of procedures 271 unit root 636 637 639 643 vector autoregression 589 590 for zero correlation 712 test statistics cointegration 657 con dence interval test 491 deriving behavior of 141 Dickey Fuller tests 645 feasible GLS 347 heteroscedasticity 508 hypothesis testing 465 678 income elasticity 53 known distributions 155 likelihood function 494 long run models 659 marginal distributions of 55 nonlinear restrictions and 109 nonnested models 751 nonnormal disturbances and 105 reformulated 156 regression analysis and 51 structural change 143 theorems 486 487 489 Theil U statistic 113 three stage least squares 3SLS 405 407 414 threshold effect 120 time pro le 120 time series data analysis of 284 autocorrelation and 192 250 covariance structures for 320 334 Durbin Watson test and 126 empirical studies and 283 ergodicity of functions theorem 541 heteroscedasticity in 215 homoscedasticity 192 macroeconometrics and 5 models with lagged variables 559 regression estimation 621 serial correlation 253 256 259 265 time series process cointegration 649 660 collinearity and 154 dependent observations and 73 74 distinguishing 5 estimation 878 homoscedasticity and 238 least squares and 566 modeling 76 111n8 nonstationary processes and unit roots 631 649 problems with 284 stationary stochastic processes 609 631 time trend cointegration 652 Dickey Fuller tests 644 investment equation and 21 22 as regression variable 27 time varying covariates 792 time window 254 tobit model

    sum of squares See also generalized sum of squares change and 30 39 as criterion function 537 example 171 minimizing 182 orthogonality condition and 164 165 Phillips curve example 569 570 sum rule 904 sums of values 807 808 superconsistent 572 656 super uous variables 148 supply equation 378 supremum test 141 602 SUR model See seemingly unrelated regressions SUR model survival function 792 795 797 858 survival model 800 801 symmetric matrix characteristic roots vectors 830 decomposition 835 idempotent 832 matrix algebra 803 839 rank of 828 system methods of estimation 396 404 411 413 systems of equations matrix algebra 819 822 simultaneous equations models 378 381 singular systems 362 369

    T
    t distribution 851 853 954 t ratio least squares 83 MCS estimator and 689 MPC example 110 restricted investment equation 99 robust estimation and 201 signi cant effects 739n65 single linear restriction and 101 test statistic 51 White estimator and 220 t statistic 104 105 t test 106 632 Taylor series asymptotic normality 464 478 CES production function 129 162 classical model 13 computation and optimization 937 differentiation and 837 840 discrete choice models 687

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    Subject Index censored data 764 766 censoring application 774 780 LM test of normality 771 misspeci cation and 770 multiplicative heteroscedasticity in 768 total variation 31 trace 829 830 832 transformation achieving stationarity via 631 Box Cox transformation 173 175 Box Jenkins approach 620 continuous distributions 921 manipulating series via 649 matrix algebra 844 845 stabilizing transformation 908 transitions time effects as 283 translog example 162 163 451 452 systems of equations 366 369 translog model demand and production studies 12 13 inference 104 nonlinear cost function 126 production function example 103 499n17 transposition 804 807 trapezoid rule 928 treatment and response 117 treatment effect 787 789 trend stationary 634 636 639 643 645 triangular matrix 803 triangular system 383 395 397 trigamma function 928 truncated bivariate normal distribution 781 truncated distribution 757 760 truncated mean 759 truncated normal distribution 757 759 760 929 932 truncated random variable 757 truncated variance 759 truncation improving estimator with 627 628 model for counts 773 774 moments of truncated distributions 758 760 truncated distributions 757 truncated regression model 760 761 TSP computer program 244n31 377 409n22 919n1 942 Tukey window 628 Twinsburg Ohio 87 88 two stage least squares 2SLS 3SLS and 406 estimator 380 Klein s Model I 412 413 least squares 74 80 method of estimation 398 400 nonlinear regression models 183 nonlinear systems and 403 404 panel data 313 speci cation tests 414 two step estimation of credit scoring model 186 189 heteroscedasticity 227 228 231 limited models 784 nonlinear least squares 183 186 two step maximum likelihood estimation 508 512 two step nonlinear least squares 740n67 two variable regression model 38 39 46 47 type I error 133 152 892 893 type II error 892 893

    1023

    V
    VAR See vector autoregression VAR variable metric algorithm 939 variables See also lagged variables random variables absence of multicollinearity 163 added to regressions 30 bias caused by omission of 148 149 censored variables 762 763 changes in R2 34 deterministic relationships among 7 dif culties measuring 4 economic variables 619n11 631 endogenous variables 381 404 416 587 existence in theories 84 explained variable 7 FDI variables 700 irrelevant variables 150 151 jointly dependent variables 379 linear relationships and 13 lognormal variables 854 macroeconomic variables 575 596 matrix algebra and 23 measurement error and 85 86 nonlinearity in 122 130 predetermined variables 380 382 383 proxy variables 87 88 relationships between 665 super uous variables 148 univariate time series and 609 VAR model for 596 variance analysis of 867 attenuation of 761 conditional variance 47 241 865 con dence intervals 892 decomposition of 866 of disturbances 15 16 least squares estimator 48 49 56 266 267 of the median 925 Poisson distribution 889 890 of random variables 847 tests of structural breaks 133 134 truncated variance 759 variance in ation factor VIF 57 variation 31 37 vector autoregression VAR analyzing series as 649 cointegration 654 655 error correction and 654 estimation 588 589

    U
    unbalanced panel 293 689 unbalanced sample 685 unbiasedness asymptotic unbiasedness 917 best linear unbiased BLU 193 best linear unbiased estimator BLUE 890 ef cient unbiased estimator 886 888 890 establishing conditionally 47 nite sample results 41 least squares estimator 44 45 151 linear unbiased estimator 46 48 minimum variance unbiasedness 887 MVLUE 44 890 nonlinear restrictions and 109 as statistical property 41 460 unbiased estimator 886 unbiased test 894 uncentered moment 527 unidenti ed structure 385 uniform kernel 455 uniformly most powerful UMP 894 unit root 631 650 univariate autoregression 574 univariate time series 609 619 621 unordered choice models 719 upper triangular matrix 832

    Greene 50240

    gree50240 Sub Ind

    July 16 2002

    21 31

    1024

    Subject Index linear regression model 494 model based tests 230 MSL estimation 514 municipal expenditures example 605 null hypothesis and 133 323 324 restrictions and 484 robust estimation and 200 201 signi cance tests for restrictions 175 177 tobit model 775 VAR testing 590 White estimator 220 Wald test application example 491 CAPM model 356 F statistic 593 functional form 134 GMM counterpart 549 551 GMM estimation 548 Grunfeld investment model 331 hypothesis testing 115 465 484 486 488 676 least squares 82 model based tests 230 noninvariance of 110 Poisson distribution 745 signi cance test for restrictions 176 speci cation analysis 158 testing for zero correlation 712 travel mode choice example 733 type I error and 133 usability of 139 White estimator 220 weak law of large numbers 900 weakly exogenous variables 382 659 660 weakly stationary 254 612 weak stationarity 260 websites BEA 282 948 economagic com 948 Fair data download 774 952 NIST 833n12 Weibull model discrete choice models 667 675 695 771 limited models 794 798 800 801 weighted average 301 567 weighted endogenous maximum likelihood WESML estimator 673 weighted least squares WLS classical regression 240 GMM estimation 537 heteroscedasticity 216 224 227 log likelihood function 688 probit models 688 two step estimation 231 weighting functions 929 weighting matrix 205 207 295n12 312 400 401 409 539 544 weighting of observations 128 well behaved data 478 483 White estimator CAPM model and 356 generalized regression model 199 201 GMM estimation 546 heteroscedasticity 220 221 panel data 315 316 pseudo MLE 520 regression equations 373 serial correlation 267 simultaneous equations models 401 410 white noise Dickey Fuller tests and 643 differencing and 635 fractional integration 647 regression models and 649 serial correlation 257 stationary stochastic processes 609 611 testing for 622 time series and 635 White s test 222 224 324 330 windowing improving estimator with 627 628 Wishart distribution 445 Wishart prior density 444 within groups estimator 289 290 WLS See weighted least squares WLS Wold s theorem 593 619 620 Wu statistic 83 WZ with zeros model 750n75

    vector autoregression continued exogeneity 590 592 GMM 555 Granger causality 592 593 impulse response functions 593 595 lagged variables 574 586 605 in microeconomics 602 605 model forms 587 588 policy analysis 596 602 stability 602 structural VARs 595 596 testing 589 590 656 vector moving average VMA 596 597 vectors cointegrating vectors 650 652 653 659 column vector 803 gradient vector 838 length of 818 linear combinations 811 matrix algebra 825 834 normal vector 873 875 random vector 868 spanning vectors 813 vector multiplication 805 vector space basis for 813 column space 814 matrix algebra 809 810 velocity of money 658 VIF variance in ation factor 57 VMA vector moving average 596 597 volatility market 238 647

    W
    Wald criterion chi squared test and 302 parameter vectors 139 testing restrictions 96 415 Wald statistic as alternative test 328 329 assumption of normality and 110 condition moment tests 507 as exponential family 530 feasible GLS 347 functional form 141 GMM estimation 548 550 551 Grunfeld investment model 331 hypothesis testing 177 327 741 inference 106 least squares 81 83 356 likelihood function and 494 likelihood test 491 limiting distribution of 107 108

    Y
    Yule Walker equations 266 616

    Z
    ZAP zero altered Poisson model 750n75 ZAP zero altered poisson model 749 752 zero correlation 712 zero matrix 804 zero mean 14 zeros blocks of 357 360 589 ZIP zero in ated Poisson model 750n75 751 779 780