Extracting Death Count from Accident Reports Using Python#
Today, I wrote a function in Python to identify the number of deaths in accident reports in the civil engineering industry. Here is my design approach:
- After observation, I found that 80% of accident reports have a consistent pattern when it comes to recording the number of deaths, which is "XX people died," where XX can be either in Chinese characters or Arabic numerals.
- Therefore, I designed a regular expression,
pattern_death = r'([\d\w]+)(人死亡|名工人死亡)'
, to identify this pattern. - During testing with several investigation reports from Anhui Province, I found that this recognition pattern also extracts the text before "XX people died," such as "5 people died in the project," until it encounters punctuation marks, because
([\d\w]+)
means extracting any number of Chinese characters or digits. - As a result, I designed a second regular expression,
pattern_death_num = r'(\d+|[零一二两三四五六七八九十百千万]+)(人死亡|名工人死亡)'
, to identify the numbers (whether in Chinese or Arabic) in the sentences recognized by the previous pattern. - Then, I downloaded a third-party library called cn2an to convert Chinese numerals into mathematical digits.
- I stored the recognized numbers in a dictionary and marked the sentences that contain "death" but without recognized numbers as "to be confirmed."
Although it may seem effortless as I describe it here, what you may not know is how long it took me to reach this point from the beginning and how much time I spent on designing the logic for recognizing conditions, avoiding errors, and storing the recognized data.
After numerous errors and iterations, the program is now functional, and this approach can extract death counts from 393 accident reports out of 393-56.
You may wonder why there are clearly 57 files marked as "to be confirmed" on the graph, so shouldn't the number of extracted data be 383-57?
Indeed, the missing file is the protagonist I want to talk about today.
The Difference of a Single Word's Workload#
This particular data consistently caused errors, and I had to test each file from each province to identify the problematic one.
I discovered that the reason for the error was that the program did recognize "X people died" and extracted "X," but the file causing the error used "两人死亡" instead of "二人死亡". The library cn2an, which specifically converts Chinese numerals into Arabic digits, can only recognize "二" and cannot recognize "两".
Therefore, I had to design a new logical statement. If "两" is recognized, it is changed to "二," and then cn2an is used to convert it into Arabic digits.
It took me nearly half an hour to identify the cause of this error and write the statement to solve the problem. Just a one-word difference required extensive testing to make the program adaptable to this situation.
As a non-computer science student, I am curious whether "spending a lot of time on error resolution" is a common occurrence in software development.
Returning to the topic, this is a relatively simple step in our data extraction process, but its workload is not small in order to accommodate all possibilities.
If we were to extract data like "accident nature" from the text, there might be various ways to describe it (which we are currently unsure of), and it would likely be more complex than dealing with numbers. Additionally, the context may not have a specific format. If that is the case, regular expressions alone may not be sufficient to achieve the goal.
If regular expressions cannot solve this problem, based on the information I currently have, we would need complex NLP techniques to identify named entities, semantic logical relationships, etc., enabling the machine to understand sentences and classify accident causes. And to make the code understand, a pre-existing database of known information is required.
If we were to use NLP techniques to extract "accident causes" from all accident reports, the workload to create a program that adapts to all situations would undoubtedly be significant.
Why is it so challenging?#
What is the reason behind the challenges we face when extracting data?
The answer is obvious: Most information in accident reports is not recorded in a standardized format. If there were a standardized format, if each accident report had a unified checklist, then each type of data would only require one regular expression to extract, and it would be compatible with all files.
Unfortunately, in terms of the format of accident reports, there is no great unifying moment like the one when Qin Shi Huang unified the six states.
Now, ChatGPT-4 should be able to automatically extract the data you need, but it requires payment. Our teacher wants us to create a program for automatic recognition, which can be quite challenging, haha.
Qin Shi Huang's Unification of the Six States#
Qin Shi Huang established the first unified centralized feudal state in Chinese history. He not only unified China but, more importantly, standardized the written language, currency, and transportation.
These unifications not only facilitated information transmission and standardization but also contributed to the inheritance and development of our Chinese culture.
Today, I encountered an error caused by the word "两" in my code and reflected on the workload required to extract "accident nature." Can it illustrate that uniform standards can solve many problems invisibly?
I believe the answer is yes.