'subscript out of bounds' error with naive bayes predict? (same levels in train/test)

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


'subscript out of bounds' error with naive bayes predict? (same levels in train/test)



I'm trying to run naive bayes on my data, a large dataframe of 35 variables, some of which are factors:


nb1927<-naiveBayes(ostpayer ~ ., data=trainoversample)
nb199pred<-predict(nb1927, testoversample, type = "class")



I keep getting the error:


Error in `[.default`(object$tables[[v]], , nd + islogical[attribs[v]]) :
subscript out of bounds



Now, I know from searching that factor levels can be a problem. HOWEVER, this same test set already got passed through logistic regression prediction with no issues after I dropped some levels. So it stands to reason the same exact test set would work for bayes, yes?



I even ran:


sapply(trainoversample, levels)
sapply(testoversample, levels)



On it and then put those results through diffchecker.com (great website btw) and it showed that my test set had FEWER levels than the train set did (because I'd dropped some for the logistic regression by coercing them into the "UNK" factors for those variables).



So it's not possibly the levels. I even did the sapply command for the train set with droplevels() and put it through diffchecker, still nothing. So it's not that the internal dropping in bayes is doing it either.


droplevels()



Any ideas?



I cannot post data or variable names, but here is an str for one of them in case it helps:


str(testoversample)
'data.frame': 405661 obs. of 35 variables:
$ 1 : int 1207532 1208246 1187313 1259718 1206948 1207319 1206577 1206725 1262913 1209568 ...
$ 2 : num 1668 1208 854 5225 347 ...
$ 3 : Date, format: "2017-04-13" "2017-04-19" "2017-02-13" "2017-11-14" ...
$ 4 : num 50 100 115 1204 30 ...
$ 5 : int 1 1 1 1 1 1 1 1 1 1 ...
$ 6 : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 5 1 1 1 1 5 1 ...
$ 7 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 8 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 9 : Date, format: "2016-02-25" "2016-11-03" "2015-12-29" "2016-11-14" ...
$ 10 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 11 : int 1 1 1 1 1 1 1 1 1 1 ...
$ 12 : num 50 100 115 1204 30 ...
$ 13 : int 284 242 224 313 225 176 318 221 108 244 ...
$ 35 : int 2773 3452 6042 3231 6104 2395 2575 6336 6392 2534 ...
$ 14 : int 1 1 1 1 1 1 1 1 1 1 ...
$ 15 : int 1 6 1 6 3 5 0 13 2 2 ...
$ 16 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 17 : int 0 0 0 0 0 0 0 1 0 0 ...
$ 18 : int 15300 11140 0 9500 8300 1100 16600 500 0 2500 ...
$ 19 : int 13692 1474 0 6916 8981 1543 9687 3 0 1820 ...
$ 20 : int 0 0 0 0 0 0 0 1 0 1 ...
$ 21 : int 0 1 0 0 0 2 0 0 0 1 ...
$ 22: int 3 1 0 1 3 2 2 0 2 0 ...
$ 23 : int 0 3 0 4 1 0 0 5 1 0 ...
$ 24 : Factor w/ 3 levels "BAD","GOOD","UNK": 2 2 2 2 2 2 2 2 2 2 ...
$ 25 : int 1 1 0 1 1 1 0 1 1 0 ...
$ 26 : Factor w/ 6 levels "CUZ","DFA","DNF",..: 4 4 4 4 4 4 4 4 4 4 ...
$ 27 : Factor w/ 50 levels "AK","AL","AR",..: 18 42 17 48 20 32 5 4 27 5 ...
$ 28 : Factor w/ 6 levels "Discharged","Dismissed",..: 3 3 3 3 3 3 3 1 3 3 ...
$ 29 : Factor w/ 3 levels "Dismissed","Other",..: 2 2 2 2 2 2 2 2 2 2 ...
$ 30 : Factor w/ 6 levels "Discharged","Dismissed",..: 3 3 3 3 3 3 3 3 3 3 ...
$ 31 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 32 : Factor w/ 13 levels "Alternate","AlternateCell",..: 6 6 2 5 5 7 6 6 6 5 ...
$ 33 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 34 : num 0 0 0 0 0 0 0 0 0 0 ...





When asking for help, you should include a simple reproducible example with sample input and desired output that can be used to test and verify possible solutions. A str() is not helpful. You don't have to share your real data, just something that will reproduce the error.
– MrFlick
yesterday


str()





Well I'm unsure how to properly make a dummy dataset for this but I'll do my best
– CapnShanty
yesterday





My "reproducible" example doesn't suffer from the same problem despite having all the same columns and all the same factor levels for train/test so idk
– CapnShanty
yesterday





Well that’s one step closer to finding the problem. Now you just need to figure out how your real data is different from the sample data.
– MrFlick
yesterday




1 Answer
1



So I, per @MrFlick's suggestion, created a reproducible example. This reproducible example worked, and I was thus more confused than I had been.



So I tried to predict my train set on a hunch, and it wouldn't even predict my train set.



I made a very small version of my test set to see if size was the problem. Nope.



I downloaded and installed a different naive bayes package (instead of using e1071). same issue.



On down the line, I tested everything I could possibly think of, and then I stumbled across the answer. I had made a train and set set for the reproducible example, and the test version of that had NA as all the column names. So I tried to run it flipped (using the repro-train as the items to be predicted, as it had the normal column names), and sure enough, it failed.



I then set the column names for my test set to NA and ran it and voila it worked!



Why? God only knows, I suspect there must be a weird character code in some variable name (this is data from our database, who knows what weird crap they've done), but if you run into the same issue try removing the column names.






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Keycloak server returning user_not_found error when user is already imported with LDAP

Using generate_series in ecto and passing a value

PHP parse/syntax errors; and how to solve them?