How ML algorithms can aid in teaching English writing

SmsLova Daria
11 min readSep 30, 2021
Image credit: Photo by Sharon McCutcheon on Unsplash

Teaching writing in English could be exigent. Criteria are vague, proofreading is costly in the time and energy that goes on writing comments. The majority of teachers could only dream of an assistant who they can trust in grading, let alone students whose writing would benefit enormously from immediate feedback. While there has been some research done in this field, not many open-source materials are available.

This work in progress dwells on the questions, how we can interpolate AI advancements into teaching, and which educational tasks could be delegated to technologies to face the challenges of the contemporary world. This project is open-source available in my Github repository which accumulates ideas and practices available in the NLP (Natural Language Processing) field that could be potentially useful for solving some applied linguistics tasks, such as teaching languages. I encourage you to watch this video where I show and explain every detail of the result of this work.

Having studied the existing research data, available products with similar functionality, similar tasks, and open-source libraries, I have outlined the following MVP for my project:

  • Find spelling and grammar mistakes in the input text, highlight them and suggest correction and explanation; calculate the percentage of mistakes in a text
  • Specify the topics an English learner covers while studying along with a level they study a particular topic (as proposed by the British Council Core Inventory); predict the topic of a writing sample and assign level in accordance with the topic complexity
  • Define the minimum of words learners are to know at each level in order to calculate the lexical density (ratio of content (each level) words towards the total number of words)
  • Collect/find a database of graded works to train a model that could predict a grade
  • Summarize the results and visualize the overall level of English of the user
  • Create the interface that a teacher/student could use.
The home page of the finalized version of the writing assessment tool.
  1. Checking spelling and grammar

There are various spell checkers available for Python starting from Norvig’s keystone algorithm to enhanced frameworks offered in contextual BERT model, fast and light-weight Jamspell model, python version of Enchant spell checking library, TextBlob, and Neuspell, a Neural Spelling Correction toolkit.

Below is a comparative table of the used spell checkers:

Despite seemingly impressive results shown by some of these libraries, in most cases correction suggestions were odd or the speed was not fast enough.

Some corrections suggested by the BERT model.

On balance, my first choice became Jamspell, then Pyenchant, and finally, Norvig. However, on the deployment stage, there were issues with a wheel (Jamspell) and installation of additional libraries (Pyenchant) which would hinder the further spread of the product.

Spelling libraries have their obvious limitations: context independence, explanation unavailability, and inability to pinpoint grammar errors. As a result, I have added grammar checking libraries to enhance accuracy.

There is a limited number of open-source materials available. The best one is a python version of the Language tool which results are impressive, but the speed is terrible (approximately 7 secs per sentence) due to the external connection to their server. Also, there is a limited amount of inquiries per IP address per day which makes it impossible to use it at a wider scale, and, in addition to this, you are unable to enlarge the pool of mistakes manually. Consequently, I dubbed it with the Nlprule library built on the same LanguageTool but without external servers. Overall, using the three libraries I was able to reach 96% of accuracy within 0.87 secs per sentence speed.

The list of files for this task:

  • spell_check.py (the spell checkers I tried, that return the dictionary of mistakes: the word, correction, its position in the text, and the explanation ‘Possible spelling mistake found’)
  • grammar_check.py (Language Tool and Nlp Rule checkers, that return the dictionary of mistakes: the word, correction, its position in the text, and the explanation that is custom for each error)
  • write_results.py (the first function gets the mistake words from the dictionary of mistakes and provides in-text annotation highlighting the incorrect words in red; the second part is providing the corrections in green and an explanation of mistakes)
  • grammar_spell.py (combines grammar and spell checkers and annotations for mistakes)

Here is the code snippet for combined models text evaluation and writing results (highlighting mistakes, suggesting correction and explanation) and exceptions when the Language Tool server is not responding:

def example_check(text: str):
suggestions_data = {}
checkers = [
spell_check_js,
grammar_check,
nlp_rule_check
]
attempt_check_count = len(checkers)

for checker in checkers:
try:
result = checker(text)
suggestions_data.update(result.data)
except Exception as error:
logger.error("Something is going wrong...", exc_info=error)
attempt_check_count -= 1
pass

if attempt_check_count:
result = SuggestCorrection(
data=suggestions_data,
percentage_of_incorrect=_calculate_actual_percentage_of_incorrect(text, suggestions_data.keys())
)
write_results(text, result.percentage_of_incorrect, result.data)
else:
st.write("Sorry, temporary unavailable")
An example of the checked text a user would see with spelling and grammar mistakes highlighted in red.
example of suggested corrections and explanations
An example of suggested corrections and explanations a user would see.

2. Lexical and topic complexity

In order to measure the advancement of a student’s vocabulary, we need to define the topic and its difficulty as well as the ratio of each level words and phrases within the whole body of text, its lexical density.

Defining topic

In accordance with the CEFR (The Common European Framework of Reference for Languages) level descriptors along with the British Council Core Inventory, and ACTFL (American Council on the Teaching of Foreign Languages) can-do statements, we can conclude that there is a level topic division that goes from everyday situations vocabulary (such as family, hobbies and pastimes, work and jobs, shopping, holidays and so on) that eventually evolve into more abstract ones (such as scientific development, news, and current affairs) or examples of jargon language (technical and legal language, media) (the full list and level correlation could be found in British Council EAQUALS). Totally there are 15 topics delineated and I created the dataset with the topics as tags and sample texts were web scrapped. Then the scikit-learn MultiLabel binarizer model was built with the Tfidf vectorizer and used for topic prediction.

Topic_recognition_tfidf.py builds the model and topic_recognition.py predicts one of the 15 topics of the input text.

This code snippet shows the assignment of the level using the recognized topic:

def topic_level(text):
# interpret topic recognition results
try:
levels = []
predictions = topic_recognition.predict_topic(text)
topics = predictions.split(', ')
for i in range(0, len(topics)):
topic = topics[i]

if topic == 'family':
level = 1
elif (topic == 'hobbies and pasttimes') or (topic =='holidays') or (topic =='shopping') or (topic =='work and jobs'):
level = 2
elif (topic == 'education') or (topic =='leisure activities'):
level = 2.5
elif (topic == 'books and literature') or (topic == 'arts') or (topic=='media') or (topic =='news, lifestyles and current affairs') or (topic == 'film'):
level = 4
elif (topic == 'scientific developments') or (topic == 'technical and legal'):
level = 5
else:
level = 0
levels.append(point)
level_topic = round_num(sum(levels)/len(levels))
return level_topic
except Exception:
pass

Estimating lexical density

In our case we assume lexical density to be the ratio of each level words in the written utterance as the measurement of language complexity. Thus, we need to define a list of vocabulary units for each of the six levels. A1 — B1 lists were taken from Cambridge English and B2 — C2 lists from Oxford University Press, toe.gr and English Vocabulary Profile. The word lists were then transformed into sets and the intersection with the previous level was excluded.

Number of words/word units (phrasal verbs, phrases with different meaning, set expressions, e.g. get off on the wrong foot, write off, etc.) per level:

First, raw input text is split into tokens and then lemmatized. We can use either NLTK or Spacy for this task:

import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
sentence_words = word_tokenize(text)
wordnet_lemmatizer = WordNetLemmatizer()
lemmas =[]
punctuations = "?:!.,;"
for word in sentence_words:
if word in punctuations:
sentence_words.remove(word)
lemma = wordnet_lemmatizer.lemmatize(word, pos="v")
lemmas.append(lemma)

Spacy code requires fewer lines:

import spacynlp = spacy.load("en_core_web_lg")
doc = nlp(text)
lemma = []
for token in doc:
lemma += [str.lower(token.lemma_)]

Then we read the vocabulary lists from files:

dict_vocab = {}
for filename in os.listdir("vocabulary lists"):
if filename.endswith("vocabulary list.txt"):
with open("vocabulary lists/" + filename, 'r') as d:
dict_vocab[filename[:2]]=d.read().split('\n')

for word in dict_vocab['C1']:
word.lower()

Exclude the intersections with other lists, find the common items with the input text and calculate lexical density:

#count lexical density for each level
set_essay = set(lemmas)
words_found = []
words_counts = []
for i in range(0, len(vocab_lists_sets)):
words_from_dict = vocab_lists_sets[i].intersection(set_essay)
words_found.append(words_from_dict)

for i in range(0, len(words_found)):
word_count = len(words_found[i])
words_counts.append(word_count)

total_count = len(lemmas)
lex_dens_list = []
for i in range(0, len(words_counts)):
lex_density = round((words_counts[i]/total_count) * 100, 2)
lex_dens_list.append(lex_density)
return lex_dens_list, words_found

Then we calculate the level by finding the average of the two highest ones. This step is followed by writing results, the percent of the words of each level, which is followed by the list of those words.

#calculate level using lexical density
def lex_dens_level(text):
lex_dens_list,words_found =_lang_level(text)
indices, L_sorted = zip(*sorted(enumerate(lex_dens_list), key=itemgetter(1)))
level_ld = (indices[0]+indices[1])/2
return level_ld

def write_lex_density(text):
level_ld = lex_dens_level(text)
level_topic = topic_level(text)

if level_topic > 0:
level_av = round_num((level_ld + level_topic)/2)
else:
level_av = round_num(level_ld)

lex_dens_list,words_found = _lang_level(text)
st.write('Based on your topic and vocabulary complexity, your level of English is {level_av}.'.format(level_av=level_av))
for i in range (0,len(lex_dens_list)):
st.write('Percentage of words from level {num} in your text is: {lex_density}% : {words}'.format(num = i+1, lex_density=lex_dens_list[i], words = words_found[i]))
i +=1
An example of predicted topic, overall level and lexical density feedback a user would see.

Lexical_complexity_level_count.py assigns level in accordance with the predicted topic and calculates lexical density.

3. Grade prediction

Grades are assigned in accordance with the requirements of the task, it is the indicator of compliance with the topic and style, use of language, and mechanics. In order to implement such a prediction, I needed a large dataset (at least 1000 works) of graded essays. My own dataset that consisted of my students’ 150 works did not provide an accurate prediction. Hence, I used the Hewlett Foundation dataset of 1982 essays provided for automated essay scoring competition and open-source solutions available on Kaggle. Each essay contains 150–550 words, has been checked by different specialists, and given a mark from 1 to 6. Thus grade prediction becomes a statistical problem here as the model deals with identifying the features of a “good”, “medium level”, and “bad” essay.

In the predict_grade.py we upload the trained TensorFlow model saved in h5 format, get the prediction (from 0 to 100) which is then converted into a letter grade:

def grade_converter(preds):
if preds >= 93:
grade = 'A+'
elif preds <= 92 and preds >= 85:
grade = 'A'
elif preds <= 84 and preds >= 75:
grade = 'B'
elif preds <= 74 and preds >= 70:
grade = 'B'
elif preds <= 69 and preds >= 65:
grade = 'C+'
elif preds <= 64 and preds >= 60:
grade = 'C'
elif preds <= 59 and preds >= 55:
grade = 'D+'
elif preds <= 54 and preds >= 50:
grade = 'D'
else:
grade = 'F'
return grade
An example of predicted grade a user would see.

However, my application does not offer any prompts to guide writing, so we should exclude the topic and style from the evaluation. As a result, I believe, defining the level at which a student writes is a more fair assessment strategy in this case. This could be used as the entry placement test or a diagnostic test at the end of a unit of study.

The level is calculated as the average of all parameters used: topic and lexis complexity, grade prediction, and two other models combined that were trained on the same HP foundation dataset. I used this open-source solution and my slightly altered version of it (models_grade_prediction.py). The overall grade is computed in the calculate_final_level.py.

4. Summary of results and visualization

To display the results with the user’s level I created a data frame with the levels (A1-C2) and their descriptions of can-do statements, a language learner competencies for each of the six levels of English. The data is shown with the Altair graph that is a built-in instrument of Streamlit. The user’s predicted level is marked red, and the description of each level hovers below the mouse cursor.

The code snippet that builds the graph using the data frame with each level description:

graph = alt.Chart(df).mark_bar().encode(
x='level',
y='number level',
column='CEFR level',
color=alt.condition(
alt.FieldEqualPredicate(field='number level', equal = grade_final),
alt.value('red'),
alt.value('silver')
),
tooltip=['level description'],
).interactive()
An example of overall calculated level and its description a user would see.

In order to build an interface, we create Streamlit_app.py that contains the text area, upload file area, buttons, and sliders. More on how to build a Streamlit app in my article. Then you run in the terminal:

streamlit run Streamlit_app.py

5. Conclusion

I believe, existing NLP tools and Machine Learning Algorithms in combination with the methodology principles for Teaching Foreign Languages could make a breakthrough in the field.

In my view, this tool shows pretty accurate results. It would be useful for students who are interested in languages, and it could aid teachers in the evaluation process.

Potential areas for development:

  • improve models for grade prediction
  • improve grammar assessment models
  • add an option for the creation of custom word lists to check students’ usage of target vocabulary
  • add sentiment analysis
  • add a summary of the text
  • collocations check (whether words agree in a sentence)
  • add various prompts and tailor the task for a closer evaluation

References:

Taghipour, K., & Ng, H. T. (2016). A neural approach to automated essay scoring. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. https://doi.org/10.18653/v1/d16-1193

Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic text scoring using neural networks. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://doi.org/10.18653/v1/p16-1068

Zock, Michael. “Computational Linguistics and Its Use in Real World.” Proceedings of the 16th Conference on Computational Linguistics -, 1996, doi:10.3115/993268.993348.

Kosslyn, S.M. (2017) The Science of Learning: Mechanisms and Principles. Building the International University. doi: 10.7551/mitpress/9780262037150.003.0011

“ACTFL.” , www.actfl.org/resources/world-readiness-standards-learning-languages.

“The CEFR Levels.” Common European Framework of Reference for Languages (CEFR), www.coe.int/en/web/common-european-framework-reference-languages/level-descriptions.

British Council — EAQUALS core inventory for General english. (n.d.). Retrieved September 30, 2021, from https://www.eaquals.org/wp-content/uploads/EAQUALS_British_Council_Core_Curriculum_April2011.pdf.

Ramalingam, V. V., et al. “Automated Essay Grading Using Machine Learning Algorithm.” Journal of Physics: Conference Series, vol. 1000, 2018, p. 012030., doi:10.1088/1742–6596/1000/1/012030.

Leeson, William, et al. “Natural Language Processing (NLP) in Qualitative Public Health Research: A Proof of Concept Study.” International Journal of Qualitative Methods, vol. 18, 2019, p. 160940691988702., doi:10.1177/1609406919887021.

“Lexical Density.” Wikipedia, Wikimedia Foundation, 10 Nov. 2020, en.wikipedia.org/wiki/Lexical_density.

Wood, Peter. “QuickAssist: Reading and Learning Vocabulary Independently with the Help of CALL and NLP Technologies.” Second Language Teaching and Learning with Technology: Views of Emergent Researchers, 2011, pp. 29–43., doi:10.14705/rpnet.2011.000005.

Joseph, Samuel R. H., and Maria Uther. “Mobile Devices For Language Learning: Multimedia Approaches.” Research and Practice in Technology Enhanced Learning, vol. 04, no. 01, 2009, pp. 7–32., doi:10.1142/s179320680900060x.

--

--

SmsLova Daria

An English teacher who adores kids during the day and an aspiring ML engineer other time. My passions are languages, technologies, NLP, and lifelong learning.