Name Classification (PyData).pdf

Name Classification (PyData) March 13, 2023 1 Name classification: How to use ChatGPT and how does it compare to machine learning language models? Can you classify a name as belonging to a person or company? Some are easy, like “Google” is a company and “Barack Obama” is a name. Some are trickier, like “John Deere”. With a labelled dataset, we can train a machine learning model to classify names into entities, a task which is called Named Entity Recognition (NER). NER is generally more challenging than just classifying names, as it can also involve detecting entities in a longer text. When I heard a friend was working on such a problem, I went straight to ChatGPT to look for answers. I soon realised that ChatGPT can do a great job itself classifying names into entities with just a couple of examples (one-shot learning). Now, if we actually productionize that using ChatGPT’s API, how would it compare to more traditional alternatives? In NLP, traditional might mean a model from just 5 years ago! In this post, I explore four ways to classify names into person or company: 1. Baseline using word frequency and logistic regression: a typical baseline for text classification 2. FastAI LSTM fine- tuning (whole network): simple fine-tuning with few lines of code 3. Huggingface DistilBERT fine-tuning (head only): more involved neural network training using PyTorch 4. ChatGPT API one-shot learning: only prompt engineering and post-processing are needed I use two public datasets available on Kaggle: IMDb Dataset for people names and 7+ Million Company Dataset for companies. Those datasets are large, with almost 20 million names! The choice of datasets was inspired by the open-source business individual classifier by Matthew Jones, which achieves 95% accuracy on this name classification task. For simplicity, I sample 1M names for training and 100k for testing with a 50-50 balance be- tween companies and people. Since we have balanced classes and ChatGPT cannot produce scores or probabilities (so we cannot use ROC AUC or average precision, definitely a big limitation of ChatGPT), I decided to use accuracy as the main metric. [2]: import os import gc import time import pandas as pd import numpy as np import requests import plotly.express as px 1 from tqdm.auto import tqdm import Levenshtein as lev from joblib import Parallel, delayed from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, f1_score from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from fastai.text.all import TextDataLoaders, text_classifier_learner, AWD_LSTM,␣ ↪ accuracy import torch from torch.utils.data import DataLoader from torch.optim import AdamW from transformers import AutoTokenizer from transformers import AutoModelForSequenceClassification import openai 1.1 Datasets First, we will download the datasets from Kaggle and do some basic preprocessing. To reproduce the results, you will need a Kaggle account and its command line installed locally. You need to add your API key and username to a file kaggle.json , which is found in the directory defined by the environment variable called KAGGLE_CONFIG_DIR [14]: !kaggle datasets download -d peopledatalabssf/free-7-million-company-dataset Downloading free-7-million-company-dataset.zip to /notebooks 95%| | 265M/278M [00:03<00:00, 93.1MB/s] 100%|| 278M/278M [00:03<00:00, 80.1MB/s] [15]: !kaggle datasets download -d ashirwadsangwan/imdb-dataset Downloading imdb-dataset.zip to /notebooks 99%|| 1.05G/1.06G [00:06<00:00, 153MB/s] 100%|| 1.06G/1.06G [00:06<00:00, 163MB/s] [16]: !unzip -j free-7-million-company-dataset.zip Archive: free-7-million-company-dataset.zip inflating: companies_sorted.csv [17]: !unzip -j imdb-dataset.zip '*name*' 2 Archive: imdb-dataset.zip inflating: data.tsv [18]: !rm free-7-million-company-dataset.zip imdb-dataset.zip We do some preprocessing, inspired by the open-source repo we got the datasets inspiration from: We lower case the people dataset since the companies dataset is all lower case (otherwise I’d suggest keeping the original case, as that can be informative). We remove odd characters and unnecessary spaces. We remove empty and null rows. [19]: companies = pd.read_csv("companies_sorted.csv", usecols=["name"]) people = ( pd.read_csv("data.tsv", sep=" \t ", usecols=["primaryName"]) # Since the companies are all lower case, we do the same here to be fair .assign(name= lambda df: df.primaryName.str.lower()).drop("primaryName",␣ ↪ axis=1) ) df = pd.concat( (companies.assign(label="company"), people.assign(label="person")) ).sample(frac=1.0, random_state=42) invalid_letters_pattern = r"""[^a-z0-9\s\'\-\.\&]""" multiple_spaces_pattern = r"""\s+""" df["clean_name"] = ( df.name.str.lower() .str.replace(invalid_letters_pattern, " ", regex= True ) .str.replace(multiple_spaces_pattern, " ", regex= True ) .str.strip() ) df = df[ ~df.clean_name.isin(["", "nan", "null"]) & ~df.clean_name.isna() & ~df. ↪ label.isna() ][["clean_name", "label"]] df.head(10) [19]: name label 10103038 jinjin wang person 5566324 native waterscapes, inc. company 8387911 jeff killian person 6783284 lisa mareck person 9824680 pablo sánchez person 6051614 dvc sales company 3 6479728 orso balla person 4014268 two by three media company 2093936 house of light and design company 11914237 hamdy faried person [21]: df.label.value_counts() [21]: person 12344506 company 7173422 Name: label, dtype: int64 [22]: train_df = pd.concat( ( df[df.label == "company"].sample(n=1_100_000 // 2), df[df.label == "person"].sample(n=1_100_000 // 2), ) ) train_df, test_df = train_test_split(train_df, test_size=100_000,␣ ↪ random_state=42) # Saving the processed dataframes locally for quicker iterations train_df.to_csv("train_df.csv", index= False ) test_df.to_csv("test_df.csv", index= False ) # Freeing up the memory used by the dataframes del companies, people, df, train_df, test_df gc.collect() [22]: 22 [3]: # Just run from here if the datasets already exist locally train_df = pd.read_csv("train_df.csv") test_df = pd.read_csv("test_df.csv") train_df.shape, test_df.shape [3]: ((1000000, 2), (100000, 2)) Now, we have one single dataset for training with 500k people and 500k companies and one single test set with 50k people and 50k companies. 1.2 Exploratory data analysis Before we actually get to the fun part, let’s understand the data we have first. I have two hypotheses to explore: 1. Do we see a different distribution of words per class? We expect some words like “ltd” to be present only in companies and words like “john” to be over-represented in names. 2. Does sentence length vary by class? We expect higher range for companies than people, as we have 4 companies which are just two characters like “EY” to mouthfuls like “National Railroad Passenger Corporation, Amtrak”. Alternatively, we could look at the number of words per sentence, since most Western names are around 3 words. Anyway, beware the Falsehoods Programmers Believe About Names. [5]: words_df = ( train_df.assign(word=train_df.clean_name.str.split(" +")) .explode("word") .groupby(["word", "label"]) .agg(count=("clean_name", "count")) .reset_index() ) total_words = words_df["count"].sum() words_df = words_df.assign(freq=words_df["count"]/total_words) person_words = ( words_df[words_df.label == "person"].sort_values("freq", ascending= False ). ↪ head(25) ) company_words = ( words_df[words_df.label == "company"].sort_values("freq", ascending= False ). ↪ head(25) ) [10]: fig = px.bar( person_words, x="word", y="freq", title="Frequency of the top 25 people words", height=400, width=1000, ) fig.update_layout(yaxis_tickformat=".2%") fig.show() 5 [9]: fig = px.bar( company_words, x="word", y="freq", title="Frequency of the top 25 company words", color_discrete_sequence=px.colors.qualitative.Plotly[1:], height=400, width=1000, ) fig.update_layout(yaxis_tickformat=".2%") fig.show() We can see our hypothesis was right: Some words are quite predictive of being a person or company name. Note that there is no intersection between the top 25 words for people and companies. This insight implies a simple but effective baseline would be a model built on top of word count, which 6 is what we do next. However, we have a long tail of possible names, so we have to go beyond the most common ones. Another way to see how the distributions differ is by sentence length: [11]: fig = px.histogram( train_df.assign(sentence_len=train_df.clean_name.str.len()), x="sentence_len", color="label", opacity=0.5, height=400, width=800, ) fig.show() Company names tend to be longer on average and have a higher variance, but interestingly they both peak at 13 characters. We could use sentence length as a feature, but let’s stick to word counts for now. 1.3 Baseline: Word frequency + Logistic regression Let’s start with a simple and traditional NLP baseline: word frequency and logistic regression. Alternatively, we could use Naive Bayes, but I prefer logistic regression for its greater generality and easier interpretation as a linear model. Typically, we use TF-IDF instead of just word counts for text classification. Since names are quite short and repetitive words (e.g. “John”) are predictive, I believe it not to be useful here. Indeed, a quick test showed no improvement in accuracy by using TF-IDF. I also tried using character n-grams, which increased preprocessing time with slightly worse results. [39]: text_transformer = CountVectorizer(analyzer="word", max_features=10000) X_train = text_transformer.fit_transform(train_df["clean_name"]) X_test = text_transformer.transform(test_df["clean_name"]) logreg = LogisticRegression(C=0.1, max_iter=1000).fit( 7 X_train, train_df.label == "person" ) preds = logreg.predict(X_test) baseline_accuracy = accuracy_score(test_df.label == "person", preds) print(f"Baseline accuracy is { round(100*baseline_accuracy, 2) } %") Baseline accuracy is 89.49% 89.5% accuracy is not bad for a linear model! Remember, since the datasets are balanced, a baseline accuracy without any information would be 50%. Now, whether this is good or bad in an absolute sense, it depends on the actual application of the model. It also depends on the distribution of the words this model would actually see in production. The datasets we used are quite general, containing all kinds of people and company names. In a real application, the names could be more constrained (e.g. only coming from a particular country). Now, let’s see what mistakes the model makes ( error analysis ). It’s always interesting to look at examples where the model makes the worst mistakes. If we have a tabular dataset, it might be diﬀicult to interpret what is going on, but, for perceptual data a human can understand (text, image, sound), this leads to invaluable insights into the model. [40]: test_df["proba_person"] = logreg.predict_proba(X_test)[:, 1] test_df["abs_error"] = np.where( test_df.label == "person", 1 - test_df.proba_person, test_df.proba_person ) test_df.sort_values("abs_error", ascending= False )[ ["clean_name", "label", "proba_person"] ].head(10) [40]: clean_name label proba_person 60581 co co mangina person 0.000206 49398 buster benton and the sons of blues person 0.000984 6192 best horizon consulting person 0.001613 83883 les enfants du centre de loisirs de chevreuse person 0.002633 84646 manuel antonio nieto castro company 0.997350 32669 chris joseph company 0.996298 8545 hub kapp and the wheels person 0.004568 77512 michael simon p.a. company 0.994109 71392 dylan ryan teleservices company 0.993017 64777 netherlands national field hockey team person 0.007220 We can see that the mistakes are mostly understandable: We have many companies named just like people. How could the model know Chris Joseph is a company and not a person? The only way would be with information not available in the data I provided for its learning. We also see mislabelings in the people dataset: “netherlands national field hockey team” and “best horizon consulting” do not sound like people names! This implies a high-leverage activity here would be cleaning the people dataset. If you want to make the data cleaning process sound sexier, just call it data-centric AI (just kidding: data-centric AI is actually a good framework to use for real-life 8 machine learning applications where, in almost all cases, data trumps modelling). 1.4 FastAI fine-tuning For the first more complex machine learning model, let’s start with FastAI due to its simple interface. Following the suggestion of this article, I used an AWD_LSTM model which was pre- trained as a language model that predicts the next word using Wikipedia as dataset. Then, I fine-tune the model with our classification problem. FastAI fine-tuning works in the following way: in the first epoch, it only trains the head (the newly inserted neural network on top of the pre-trained language model), then, for all subsequent epochs, it trains the whole model together. FastAI uses many tricks to make the training more effective, which is all wrapped in a simple function call. While convenient, it makes understanding what is going on behind the scenes and any customization more diﬀicult. [3]: fastai_df = pd.concat((train_df.assign(valid= False ), test_df. ↪ assign(valid= True ))) dls = TextDataLoaders.from_df( fastai_df, text_col="clean_name", label_col="label", valid_col="valid" ) learn = text_classifier_learner(dls, AWD_LSTM, drop_mult=0.5, metrics=accuracy) <IPython.core.display.HTML object> <IPython.core.display.HTML object> [4]: learn.fine_tune(5, 1e-2) <IPython.core.display.HTML object> <IPython.core.display.HTML object> <IPython.core.display.HTML object> <IPython.core.display.HTML object> Now we ended with 97.1% accuracy, almost 8 percentage points higher than our baseline! Not bad for a few lines of code and one hour of GPU time. Can we do better? Let’s try using a transformer. 1.5 Hugging Face DistilBERT fine-tuning Hugging Face offers hundreds of possible deep learning models for inference and fine-tuning. I chose DistilBERT due to time and GPU memory constraints. By default, Hugging Face trainer will fine-tune all the weights of the model, but now I just want to train the classification head, which is a two-layer fully-connected neural network (aka MLP). The reason is twofold: 1. We’re dealing with a simple problem and 2. I don’t want to leave the model training for too long to make reproducibility simpler and reduce GPU costs. I worked backwards from the previous results: Since FastAI took roughly one hour, I wanted to use the same GPU time budget here. To only train the classification head, I had to use the PyTorch interface, which allows for more flexibility. First, I download DistilBERTs tokenizer, apply it to our dataset, then download the model itself, mark all layers as requiring no gradient (i.e. not trainable), and then train 9 [13]: # Hugging Face PyTorch parameters batch_size = 32 num_epochs = 3 learning_rate = 3e-5 [5]: torch.cuda.empty_cache() [6]: tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") Downloading tokenizer_config.json: 0%| | 0.00/28.0 [00:00<?, ?B/s] Downloading config.json: 0%| | 0.00/483 [00:00<?, ?B/s] Downloading vocab.txt: 0%| | 0.00/226k [00:00<?, ?B/s] Downloading tokenizer.json: 0%| | 0.00/455k [00:00<?, ?B/s] [7]: tokenized_train_df = tokenizer( text=train_df["clean_name"].tolist(), padding= True , truncation= True ) tokenized_test_df = tokenizer( text=test_df["clean_name"].tolist(), padding= True , truncation= True ) [8]: # Since the dataset created by the tokenizers include both the tokens and the ␣ ↪ attention masks, # we need to use a custom Dataset for PyTorch to feed the batches correctly class NamesDataset (torch.utils.data.Dataset): def __init__(self, encodings, labels): self.encodings = encodings self.labels = labels def __getitem__(self, idx): item = {key: torch.tensor(val[idx]) for key, val in self.encodings. ↪ items()} item["labels"] = torch.tensor(self.labels[idx]) return item def __len__(self): return len(self.labels) [9]: train_dataset = NamesDataset( tokenized_train_df, (train_df.label == "person").astype(int) ) test_dataset = NamesDataset(tokenized_test_df, (test_df.label == "person"). ↪ astype(int)) 10 [11]: train_dataloader = DataLoader(train_dataset, shuffle= True ,␣ ↪ batch_size=batch_size) test_dataloader = DataLoader(test_dataset, batch_size=batch_size) [12]: model = AutoModelForSequenceClassification.from_pretrained( "distilbert-base-uncased", num_labels=2 ) Downloading pytorch_model.bin: 0%| | 0.00/256M [00:00<?, ?B/s] Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias'] - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. [13]: for param in model.distilbert.parameters(): param.requires_grad = False [14]: device = torch.device("cuda") if torch.cuda.is_available() else torch. ↪ device("cpu") model.to(device) optimizer = AdamW(model.parameters(), lr=learning_rate) [15]: def eval_acc(eval_df): model.eval() acc = [] probas = [] with torch.no_grad(): for batch in tqdm(eval_df, desc="eval progress"): batch = {k: v.to(device) for k, v in batch.items()} pred = model(**batch) acc.append((pred["logits"][:, 1] > 0) == batch["labels"]) 11 probas.append(torch.nn.functional.softmax(pred["logits"], dim=1)[:,␣ ↪ 1]) acc = 100 * torch.cat(acc).cpu().numpy().mean() probas = torch.cat(probas).cpu().numpy() return acc, probas [16]: progress_bar = tqdm(range(len(train_dataloader) * num_epochs), desc="training␣ ↪ progress") for epoch in range(num_epochs): model.train() for batch in train_dataloader: batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() optimizer.zero_grad() progress_bar.update(1) test_acc, test_preds = eval_acc(test_dataloader) print(f"epoch { epoch } : test accuracy is { test_acc } %") training progress: 0%| | 0/93750 [00:00<?, ?it/s] eval progress: 0%| | 0/3125 [00:00<?, ?it/s] epoch 0: test accuracy is 96.527% eval progress: 0%| | 0/3125 [00:00<?, ?it/s] epoch 1: test accuracy is 96.776% eval progress: 0%| | 0/3125 [00:00<?, ?it/s] epoch 2: test accuracy is 96.854% We got 96.8% accuracy, essentially the same as FastAI LSTM model. This implies the extra complexity here was for nought. Of course, this problem is a simple one: If we had a more complex problem, I’m sure using a stronger pre-trained language model would give an edge relative to the simpler LSTM trained on Wikipedia. Also, by not fine-tuning the whole network, we miss out on the full power of the transformer. But this suggests that you shouldn’t write off FastAI without trying, which, as I show above, is quite simple. Let’s see which mistakes this model is making: [18]: test_df["proba_person"] = test_preds test_df["abs_error"] = np.where( 12 test_df.label == "person", 1 - test_df.proba_person, test_df.proba_person ) test_df.sort_values("abs_error", ascending= False )[ ["clean_name", "label", "proba_person"] ].head(10) [18]: clean_name label \ 6192 best horizon consulting person 47006 rolf schneebiegl & seine original schwarzwald-musi person 58512 development person 9404 xin yuan yao company 59585 cheng hsong company 46757 compagnie lyonnaise du cin ma person 38224 pawel siwczak company 25983 sarah hussain company 23870 manjeet singh company 73909 glassworks person proba_person 6192 0.000008 47006 0.000326 58512 0.000363 9404 0.999556 59585 0.999490 46757 0.000550 38224 0.999389 25983 0.999311 23870 0.999295 73909 0.000776 Again, we see cases of clear mislabeling in the case of person and some tough cases in the case of company. Given the accuracy and the worst mistakes, we may be at the limit of what can be done for this dataset without cleaning it. Now, the final question: Can we get the same level of accuracy without any supervised training at all? 1.6 ChatGPT one-shot learning We will use OpenAI’s API to ask ChatGPT to do name classification for us. First, we need to define the prompt very carefully, what is now called prompt engineering . There are some rules of thumb for prompt engineering. For example, always give concrete examples before asking ChatGPT to generalize to new ones. The ChatGPT API has three prompt types: * System : Helps set the tone of the conversation and gives overall directions * User : Represents yourself, use it to state your task or need * Assistant : Represents ChatGPT, use it to give examples of valid or reasonable responses You can mix and match all prompt types, but I suggest starting with the system one, having at least one round of task-response examples, then restating the task that will actually be completed by ChatGPT. 13 Here, I ask for ChatGPT to classify 10 names into person or company. If I ask for more, say 100 names, there is a higher chance of failure (e.g. it sees a weird string and complains there is nothing it can do regarding the whole batch). If there is still a failure, I do a backup query on each name individually. If ChatGPT fails to provide a clear answer on an individual name, I default to answering “company” since this class contains more problematic strings. Finally, how can I extract the labels from ChatGPT’s response? It might answer differently, for example, by fixing a misspelling or by using uppercase instead of lowercase (system prompt notwithstanding). In general, it answers in the same order, but can I rely on that completely for all 100k examples? To be safe, I do a simple string matching based on the Levenshtein distance to match the names I query with ChatGPT’s responses. To reproduce the code below, you need to have an OpenAI account and OPENAI_API_KEY set in your environment. [20]: system_prompt = """ You are a named entity recognition expert. You only answer in lowercase. You only classify names as "company" or "person". """ task_prompt = "Classify the following names into company or person:" examples_prompt = """google: company john smith: person openai: company sam altman: person""" base_prompt = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": task_prompt}, {"role": "assistant", "content": examples_prompt}, {"role": "user", "content": task_prompt}, ] def get_chatgpt_preds(batch_df): """ Gets predictions for a whole batch of names using ChatGPT's API""" prompt = base_prompt.copy() prompt += [{"role": "user", "content": " \n ".join(batch_df.clean_name)}] openai.api_key = os.getenv("OPENAI_API_KEY") # Max tokens as 20000 is enough in practice for 10 names plus the prompt # Temperature is set to 0 to reduce ChatGPT's "creativity" # Model `gpt-3.5-turbo` is the latest ChatGPT model, which is 10x cheaper ␣ ↪ than GPT3 response = openai.ChatCompletion.create( 14 model="gpt-3.5-turbo", messages=prompt, max_tokens=2000, temperature=0 ) # Since we gave examples as "name: class", ChatGPT almost always follows ␣ ↪ this pattern in its answers text_result = response["choices"][0]["message"]["content"] clean_text = [ line.lower().split(":") for line in text_result.split(" \n ") if ":" in ␣ ↪ line ] # Fallback query: if I cannot find enough names on the response, I ask for ␣ ↪ each name separately # Without it, we'd have parsing failures once every 10 or 20 batches if len(clean_text) < len(batch_df): clean_text = [] for _, row in batch_df.iterrows(): prompt = base_prompt.copy() prompt += [{"role": "user", "content": row.clean_name}] response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=prompt, max_tokens=2000,␣ ↪ temperature=0 ) row_response = response["choices"][0]["message"]["content"] if ":" in row_response: clean_text.append( [row_response.split(":")[0], row_response.split(":")[-1]] ) else : clean_text.append([row.clean_name, "company"]) # defaults to ␣ ↪ company # To ensure I'm matching the query and the corresponding answers correctly, # I find the closest sentences in the Levenshtein distance sense batch_df = batch_df.copy() batch_df = batch_df.merge(pd.DataFrame({"resp": clean_text}), how="cross") batch_df["resp_name"] = batch_df.resp.str[0].str.strip() batch_df["resp_pred"] = batch_df.resp.str[-1].str.strip() batch_df["dist"] = batch_df.apply( lambda row: lev.distance(row.clean_name, row.resp_name), axis=1 ) batch_df["rank"] = batch_df.groupby("clean_name")["dist"].rank( method="first", ascending= True ) batch_df = batch_df.query("rank==1.0")[["clean_name", "label", "resp_pred"]] 15 return batch_df [21]: chatgpt_num_workers = 32 chatgpt_batch_size = 10 split_size = len(test_df) // chatgpt_batch_size test_batches = np.array_split(test_df, split_size) chatgpt_preds = Parallel(n_jobs=chatgpt_num_workers, verbose=5)( delayed(get_chatgpt_preds)(batch_df) for batch_df in test_batches ) [Parallel(n_jobs=32)]: Using backend LokyBackend with 32 concurrent workers. [Parallel(n_jobs=32)]: Done 8 tasks | elapsed: 6.9s [Parallel(n_jobs=32)]: Done 98 tasks | elapsed: 18.6s [Parallel(n_jobs=32)]: Done 224 tasks | elapsed: 32.5s [Parallel(n_jobs=32)]: Done 386 tasks | elapsed: 50.4s [Parallel(n_jobs=32)]: Done 584 tasks | elapsed: 1.2min [Parallel(n_jobs=32)]: Done 818 tasks | elapsed: 1.6min [Parallel(n_jobs=32)]: Done 1088 tasks | elapsed: 2.2min [Parallel(n_jobs=32)]: Done 1394 tasks | elapsed: 2.7min [Parallel(n_jobs=32)]: Done 1736 tasks | elapsed: 3.3min [Parallel(n_jobs=32)]: Done 2114 tasks | elapsed: 4.1min [Parallel(n_jobs=32)]: Done 2528 tasks | elapsed: 4.8min [Parallel(n_jobs=32)]: Done 2978 tasks | elapsed: 5.7min [Parallel(n_jobs=32)]: Done 3464 tasks | elapsed: 6.6min [Parallel(n_jobs=32)]: Done 3986 tasks | elapsed: 7.6min [Parallel(n_jobs=32)]: Done 4544 tasks | elapsed: 8.6min [Parallel(n_jobs=32)]: Done 5138 tasks | elapsed: 9.7min [Parallel(n_jobs=32)]: Done 5768 tasks | elapsed: 10.9min [Parallel(n_jobs=32)]: Done 6434 tasks | elapsed: 12.1min [Parallel(n_jobs=32)]: Done 7136 tasks | elapsed: 13.4min [Parallel(n_jobs=32)]: Done 7874 tasks | elapsed: 14.8min [Parallel(n_jobs=32)]: Done 8648 tasks | elapsed: 16.2min [Parallel(n_jobs=32)]: Done 9458 tasks | elapsed: 17.7min [Parallel(n_jobs=32)]: Done 10000 out of 10000 | elapsed: 18.7min finished [22]: chatgpt_preds = pd.concat(chatgpt_preds) chatgpt_accuracy = (chatgpt_preds.resp_pred == chatgpt_preds.label).sum() / len( chatgpt_preds ) print(f"ChatGPT accuracy is { round(100*chatgpt_accuracy, 2) } %") ChatGPT accuracy is 97.52% Incredible! ChatGPT managed to outperform complex neural networks trained for this specific task. One explanation is that it used its knowledge of the world to understand some corner cases 16 that the models could not have possibly learned from the training set alone. Let’s see the raw responses: [23]: chatgpt_preds.resp_pred.value_counts().head(20) [23]: person 50955 company 48913 company or person (not enough context to determine) 13 it is not clear whether it is a company or a person. 11 neither (not a name) 7 not a name 6 it is not clear whether this is a company or a person. 5 cannot be classified as either company or person 4 company or person (not enough information to determine) 4 i'm sorry, i cannot classify this name as it does not appear to be a valid name. 3 neither 3 it is not clear whether it is a person or a company. 2 n/a (not a name) 2 i am sorry, i cannot classify this name as it does not provide enough information to determine if it is a company or a person. 2 this is not a name. 2 this is not a valid name. 2 person (assuming it's a misspelling of a person's name) 2 neither person nor company (not a name) 2 person or company (without more context it is difficult to determine) 2 place 2 Name: resp_pred, dtype: int64 For the vast majority of cases, ChatGPT answers as I request: person or company. In very rare 17 cases, it states it doesn’t know, it’s not clear or could be both. What are such examples in practice? [24]: pd.options.display.width = 0 chatgpt_preds[~chatgpt_preds.resp_pred.isin(["person", "company"])].head(10) [24]: clean_name label \ 55 alkj rskolen ringk bing company 55 81355 person 22 telepathic teddy bear person 11 agebim company 44 i quit smoking company 33 saint peters church company 88 holy trinity lutheran church akron oh company 55 displayname company 66 ken katzen fine art company 66 columbus high school company resp_pred 55 neither (not a valid name) 55 cannot be classified without more context 22 neither 11 it is not clear whether it is a company or a person. 44 neither company nor person 33 company (assuming it's a church organization) 88 company (assuming it's a church organization) 55 company or person (not enough context to determine) 66 company or person (not enough context to determine) 66 company (assuming it's a school) The names ChatGPT cannot classify are definitely tricky, like “81355” or “telepathic teddy bear”. In some cases, like for “saint peters church”, it does get it right with some extra commentary in parenthesis. All in all, I’d say ChatGPT did an amazing job and failed in a very human way. 1.7 Conclusion We have explored 4 ways to classify names: from a simple logistic regression to a complex neural network transformer. In the end, a general API from ChatGPT outperformed them all without any proper supervised learning. Method Accuracy (%) Baseline 89.5 Benchmark 95 FastAI 97.1 Hugging Face 96.9 ChatGPT 97.5 There is a lot of hype around LLMs and ChatGPT, but I’d say it does deserve the attention it’s getting. Those models are transforming tasks that required deep machine learning knowledge into 18 software + prompt engineering problems. As a data scientist, I’m not worried about those models taking over my job, as predictive modelling is only a small aspect of what a data scientist does. 19