Saturday 27th of April 2024
 

SED: An Algorithm for Automatic Identification of Section and Subsection Headings in Text Documents


Bello Aliyu Muhammad, Rahat Iqbal, Anne James and Dianabasi Nkanta

The word processing applications, such as the Microsoft Word Office, have advanced features like the automatic table of contents (ToC) feature. The ToC is a representation of the headings of both sections and subsections that are within the document. Currently, there is no computational procedure to transverse the document and identify section and subsections to extract this information needed for ToC and other text analytics purposes. All the applications rely on the users to identify and highlights the texts (headings and subheadings) within the document that are to appear in the ToC. Text documents are organised into sections and subsections each with a named heading and subheading. This paper presents a novel algorithm for identifying the headings and subheadings within text documents. The automatic identification of the headings and subheadings (of all the sections) in the document. By leveraging this novel algorithm, the generation of the table of contents can be fully automated such that users do not have to identify/select the headings and subheadings manually. The algorithm is simple, rule-based and unsupervised. This improves the process and saves a great deal of time as there is no training involved. The algorithm has been tested on several documents (papers) and achieved an accuracy of over 82%. The algorithm also improves the computational capabilities of the current natural language processing approaches. It is also useful for automating some tasks in systematic literature reviews and would speed up the analysis and evaluation of the natural language resources and text analytics in general.

Keywords: Natural language processing, big data, text mining, information retrieval, algorithm.

Download Full-Text


ABOUT THE AUTHORS

Bello Aliyu Muhammad
Holds PhD in Text and Data Mining from Coventry University UK. Published a number of high quality conference and journal papers. Research Interest include machine learning, natural language processing, Information retrieval.

Rahat Iqbal
A reader in Human-centred technology at Coventry University UK. A particular focus of his research is to balance technological factors with human aspects so as to explore the implications for better design of collaborative computing and information retrieval systems

Anne James
The research interests of Professor Anne James are in the general area of creating distributed systems to meet new and unusual data and information challenges

Dianabasi Nkanta
Research interest include Artificial Intelligence for Control”, “Enterprise Systems Development”, “Advanced Computer Architecture” and “Software Engineering”.


IJCSI Published Papers Indexed By:

 

 

 

 
+++
About IJCSI

IJCSI is a refereed open access international journal for scientific papers dealing in all areas of computer science research...

Learn more »
Join Us
FAQs

Read the most frequently asked questions about IJCSI.

Frequently Asked Questions (FAQs) »
Get in touch

Phone: +230 911 5482
Email: info@ijcsi.org

More contact details »