4 min read

Why this field is missing?

One of the daily challenge for someone who works with data is missing data. After a few years working with historical data (which means it is not captured for your analysis), I noticed that data was seldom missing at random and understanding how/why it is missing could be very important for the analysis (especially when your analysis would have significant impact and needs to be carefully validated).

Before I go into the details, I would say my experiences are based on insurance pricing, namely, how to come up with appropriate price for a given risk. For example, when you quote for a personal auto policy, the prcing model should adequately give some discount for positive traits and experiences (safe driver, no moving violation etc.) and surcharge for other less desirable traits (young male driver with a fancy sports car). As insurance pricing is heavily regulated in US at state level, the final pricing model needs to be filed to state regulators (for example, in Massachusetts, division of insurance) and sometimes available to public. Obviously, the final model needs to be reasonable even to people who is not familiar with statistics. The need for interpretability and obvious dollar impact of the models (if your model is less accurate compared to competitors, you will get adverse selected) would mean the data scientists need to understand all parts of their models and are able to explain them. Combined with the nature of our data(historical and often human input), it leads to many rabbit holes that I would try to summarize here:

  1. The field is not applicable for the record. For example, in property insurance, you can only insure the contents of the building but not the building itself(when you are leasing for example). Moreover, the contents might not even in a building (stored outside). As a result, the corresponding record would not have building information such as number of stories.

  2. The field is not available for the record. When I say the information is not available, I mean the information itself exists and applicable but we just don’t have it. In general, those missing can due to either real unavailability of the data to the data scientists or mistakes that can be corrected. An non-exhaustive list (would update when I notice new cases):
  • The insured/agent does not know. For example, one may not know the exact year in which his/her property was built.
  • The insured has an incentive not to tell the insurance company. For example, one may not want to tell insurance company that he/she has multiple at-fault accidents with previous insurance company.
  • The third party data vendor cannot provide matched record. There are at least two potential reasons for this type of missing: one is that the data is not collected by the vendor (a newly established business may not have a credit history yet). Another reason could be that our record key is different from the vendor’s (slightly different business name for example).
  • The options we provided in the user interface do not include the actual value. For example, the insured may have a vehicle registered in another country and thus the state/zip code field would be missing.
  • The insured/agent does not understand the options we provided clearly and thus not able to choose the right value. For example, for business insurance, we usually need to know what this business is about. There are a few different ways (SICS, NAICS) to characterize the business type but it can be difficult to match one business into one code(as there are thousands of them).
  • The insured/agent does not think the field is important and simply ignore it. This happens especially often when we ask for information about multiple locations/vehicles/drivers as it can be time consuming to provide all those information.
  • The process changed and the field is not longer available. Or this field is only captured after a certain date and not available before.
  • When your data is coming from multiple sources (there might be multiple quoting platforms for example), the data available in one source may not be available in another.
  • The data is missing due to import/export. Transform one data format to another may cause missing values when the the field is not missing actually in orginal source.
  • The data is missing due to data processing error. For example, when one re-categorized a field, one may make a mistake and accidentally assign a valid value to missing level.
  1. The field value is not valid. Technically, the value is not missing but it is obviously counterintuitive and contains no valid information so I include it as a form of missing data. For example, I have seen 2010 years old driver, 0 stories building, workers that was paid 1 cent annual salary and many more. Those are actually more difficult to process as there are no general patterns for those missing values.

I’ll use the next post to explain what I have tried when facing with all those missing data and consideration involved but at least the new website is no longer empty!

Tree not in a random forest

Tree not in a random forest