Aircraft Crash Analysis by Word Cloud
This article involves the use of Natural Language Processing (NLP), with the target of analyzing the causes of airplane accidents between 1969 and 2009. We used the data set provided by data.world, which is a detailed database about the airplane crashes and gives the opportunity to make an in-depth analysis for anyone interested in the subject. As it was mentioned in the previous paper, the data started from 1908, but we decided to analyze the modern era of flight in order to reflect the effectiveness of the modern-day aerospace safety standards.
Wikipedia’s definition is: “Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of “understanding” the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.”
A tag cloud (word cloud or weighted list in visual design) is a novelty visual representation of text data. Bigger term means greater weight. Word Cloud provides an excellent option to analyze the text data through visualization in the form of tags, or words, where the importance of a word is explained by its frequency.
Some domain knowledge here, Wikipedia’s definition for an aviation accident is: “An aviation accident is defined by the Convention on International Civil Aviation Annex 13 as an occurrence associated with the operation of an aircraft, which takes place from the time any person boards the aircraft with the intention of flight until all such persons have disembarked, and in which a) a person is fatally or seriously injured, b) the aircraft sustains significant damage or structural failure, or c) the aircraft goes missing or becomes completely inaccessible.”
According to Federal Aviation Administration (FAA), top 10 leading causes of fatal general aviation accidents 2001–2016 are:
1. Loss of Control Inflight,
2. Controlled Flight into Terrain,
3. System Component Failure — Powerplant,
4. Fuel Related,
5. Unknown or Undetermined,
6. System Component Failure — Non-Powerplant,
7. Unintended Flight in IMC (Instrument Meteorological Conditions),
8. Midair Collisions,
9. Low-Altitude Operations,
Another approach summarizes the reasons for airline disasters as follows:
1. Pilot Error — 50%.
2. Mechanical Failure — 20%.
3. Weather — 10%.
4. Sabotage — 10%.
5. Human Error (other) — 10%>
We used Word Clouds with the aim of concentrating on the following issues:
- What are the major causes of the aviation accidents?
- Which operators have the highest accident rates?
- Is the time of the day a parameter?
- Which planes have the worst accident statistics?
The first step is to import and clean the data (if needed) using pandas before starting the cluster analysis.
5268 accidents were recorded in the database with some nulls. Summary column has 390 NaNs and these rows were dropped.
The next steps were forming two new columns; ‘year’ and ‘hour’. ‘year’ was extracted from the ‘Date’ column and ‘hour’ was taken from the ‘Time’ column.
After that, a new data frame was generated using only the columns, which will be necessary for further analysis.
Now that the data frame is ready for the further steps, next stage was to prepare the ‘summary’ column for Word Cloud. The function below performs;
· word tokenizing,
· getting rid of special characters and punctuations,
· removing the stop words,
· joining all the words in the summary parts of the reports.
Word Cloud — Report Summaries
We know that a word cloud is a collection of words depicted in different sizes. The bigger and bolder the word appears, the more often it is mentioned within a given text and arguably, the more important it is.
Our approach in analyzing the accident report summaries is based on finding the most frequent words and then preparing the Word Cloud in order to get the results in a visual form.
The wordcloud above had the potential to cause misconceptions because the most emphasized words are expected in every accident report but did not tell anything about the causes. The code below shows the most frequent words in the reports, and we removed the first three words.
The new wordcloud below is a more accurate picture of the accident reports:
Word Cloud — Operators
Some operators have a long record of accidents. The wordcloud below gives an idea about the riskiest operators in the world. One thing to keep in mind is the potential high risk of military flights.
Word Cloud — Type
The wordcloud below shows the airplane manufacturers. The result may be considered misleading, as some of the companies has been in the market for almost a hundred years and thousands of their planes are flying millions of hours annually.
Word Cloud — Location
The wordcloud below was prepared by using the ‘Location’ column in the data frame. It should be kept in mind that Russian airplanes have a long history of accidents and covering an expanse of over 6.6 million square miles, Russia is the world’s largest country by landmass.
Word Cloud — Hour of the Day
The final wordcloud was prepared with the intention of investigating the riskiest time frame of the day. The figure below gives a poor/insufficient answer to this question.
In this article, we used wordcloud as a data visualization technique and tried to analyze aircraft accidents between 1969 and 2009. The analysis provides evidence that:
- Take-off, approach and landing are the dangerous phases of flights.
- There were a lot of accidents in the largest countries and territories.
- The riskiest hours for accidents are 9 a.m. and 7 p.m.
You can access to this article and similar ones here.