COVID-19-Brasil Análise de dados contra a Lei de Benford Devido a incertezas a cerca dos números reais sobre o COVID-19 no Brasil, esta é uma análise que visa descobrir se os casos de infecção e mortes reportados oficialmente correspondem à famosa Distribuição de Benford. Mais informacões sobre a Lei de Benford : https://pt.wikipedia.org/wiki/Lei_de_Benford (https://pt.wikipedia.org /wiki/Lei_de_Benford) Esta análise não visa, em nenhum momento, provar ou desprovar qualquer irregularidade sobre os dados registrados pelo país, mas sim indicar se existe ou não a necessidade de uma averiguação mais detalhada por parte das autoridades quanto ao método de coleta e validação dessas informações Utilizaremos ferramentas de análise de dados disponíveis na linguagem Python e suas bibiotecas, e a sequência abaixo descrita pode ser reproduzida por quem tiver interesse. Estamos usando números oficiais, distribuídos por Estado e Data entre Janeiro e Abril de 2020. Primeiro, iremos importar as Bibliotecas. In [1]: %matplotlib inline import numpy as np import pandas as pd import benford as bf Agora, iremos carregar os dados (Fonte https://covid.saude.gov.br/ (https://covid.saude.gov.br/)) In [2]: covid = pd.read_csv('data/covid.csv', index_col='idx', parse_dates=True) Vamos dar uma olhada nas linhas e colunas que obtivemos. 01/05/20 09:59 In [3]: covid Out[3]: regiao estado data casosNovos casosAcumulados obitosNovos obitosAcumulados idx 1 Norte RO 2020-01-30 0 0 0 0 2 Norte RO 2020-01-31 0 0 0 0 3 Norte RO 2020-02-01 0 0 0 0 4 Norte RO 2020-02-02 0 0 0 0 5 Norte RO 2020-02-03 0 0 0 0 ... ... ... ... ... ... ... ... 2480 Centro-Oeste DF 2020-04-26 53 1066 1 27 2481 Centro-Oeste DF 2020-04-27 80 1146 0 27 2482 Centro-Oeste DF 2020-04-28 67 1213 1 28 2483 Centro-Oeste DF 2020-04-29 62 1275 0 28 2484 Centro-Oeste DF 2020-04-30 81 1356 2 30 2484 rows × 7 columns Os valores que iremos comparar são casosNovos e obitosNovos. Novos Casos Reportados diariamente, para todos os Estados Usaremos a regra dos primeiros dígito para analisarmos contra a Lei de Benford. No gráfico as barras em AZUL indicam ocorrência de dígitos DENTRO da Lei de Benford, barras AMARELAS indicam ocorrências de dígitos FORA da Lei de Benford. Iremos analisar por dois métodos: Primeiro Dígito e Dois Primeiros Dígitos. 01/05/20 09:59 In [44]: f1d = bf.first_digits(covid.casosNovos, digs=1, decimals=8, confidence=80) Initialized sequence with 1126 registries. First_1_Dig 1 0.296625 2 0.206927 3 0.141208 4 0.098579 5 0.070160 6 0.064831 7 0.045293 8 0.041741 9 0.034636 Name: Found, dtype: float64 Test performed on 1126 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 2 0.176091 0.206927 2.677430 3 0.124939 0.141208 1.606001 Os dígitos 2 e 3 apresentaram uma discrepância significativa (acima de 1.5) quanto ao número de casos reportados nacionalmente In [ ]: Agora, a regra dos 2 primeiros dígitos. 01/05/20 09:59 In [43]: f2d = bf.first_digits(covid.casosNovos, digs=2, decimals=8, confidence=80) Initialized sequence with 1126 registries. First_2_Dig 10 0.103908 11 0.026643 12 0.027531 13 0.023979 14 0.023979 ... 95 0.001776 96 0.000888 97 0.000000 98 0.002664 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 1126 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 20 0.021189 0.095915 17.307789 30 0.014240 0.061279 13.196401 10 0.041393 0.103908 10.456229 40 0.010724 0.042629 10.249563 70 0.006160 0.029307 9.736247 90 0.004799 0.023979 9.097349 80 0.005395 0.024867 8.716322 50 0.008600 0.031972 8.331923 60 0.007179 0.023091 6.148186 01/05/20 09:59 Embora quando analisa-se apenas o primeiro dígito, a curva não se distancia muito do esperado, mas quando usamos 2 digitos nota-se uma evidente tendência das ocorrências com os dígitos multiplos de 10 aparecerem numa frequência muito diferente do esperado pela Lei de Benford, com discrepâncias acima de 10 no Z_score Óbitos Reportados diariamente, para todos os Estados Usaremos a mesma regra para analisarmos os registros de óbitos. 01/05/20 09:59 In [42]: f1d = bf.first_digits(covid.obitosNovos, digs=1, decimals=8, confidence=80) Initialized sequence with 589 registries. First_1_Dig 1 0.421053 2 0.220713 3 0.147708 4 0.071307 5 0.049236 6 0.030560 7 0.023769 8 0.020374 9 0.015280 Name: Found, dtype: float64 Test performed on 589 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 1 0.301030 0.421053 6.305275 2 0.176091 0.220713 2.789040 3 0.124939 0.147708 1.608930 In [ ]: Nesse caso, todos os dígitos se desviaram da curva, sendo os dígitos 1,2 e 3 co m a maior discrepância Agora, a regra dos 2 primeiros dígitos. 01/05/20 09:59 In [41]: f2d = bf.first_digits(covid.obitosNovos, digs=2, decimals=8, confidence=80) Initialized sequence with 589 registries. First_2_Dig 10 0.331070 11 0.005093 12 0.018676 13 0.008489 14 0.006791 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 589 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 10 0.041393 0.331070 35.189639 20 0.021189 0.174873 25.755579 30 0.014240 0.127334 22.992064 40 0.010724 0.056027 10.474613 50 0.008600 0.042445 8.672368 60 0.007179 0.022071 4.037273 70 0.006160 0.018676 3.618590 80 0.005395 0.016978 3.556288 90 0.004799 0.015280 3.382709 01/05/20 09:59 Quando analisamos a curva com apenas o primeiro dígito, no caso dos óbitos reportados, notamos uma diferença considerável do esperado pela Lei de Benford e, ao analisarmos os mesmos dados para dois dígitos, a discrepância é muito mais acentuada. Separando por Região Analisemos os mesmos dados, porém separando-os por região federativa. Norte In [8]: covid_Norte = covid[covid['regiao']=='Norte'] Casos 01/05/20 09:59 In [40]: f2d = bf.first_digits(covid_Norte.casosNovos, digs=1, decimals=8, confidence=8 0) Initialized sequence with 246 registries. First_1_Dig 1 0.292683 2 0.252033 3 0.101626 4 0.101626 5 0.044715 6 0.073171 7 0.048780 8 0.048780 9 0.036585 Name: Found, dtype: float64 Test performed on 246 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 2 0.176091 0.252033 3.043371 01/05/20 09:59 In [39]: f2d = bf.first_digits(covid_Norte.casosNovos, digs=2, decimals=8, confidence=8 0) Initialized sequence with 246 registries. First_2_Dig 10 0.134146 11 0.032520 12 0.016260 13 0.024390 14 0.012195 ... 95 0.008130 96 0.000000 97 0.000000 98 0.004065 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 246 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 20 0.021189 0.121951 10.752425 40 0.010724 0.065041 7.961669 10 0.041393 0.134146 7.143217 70 0.006160 0.040650 6.506153 60 0.007179 0.040650 5.840977 80 0.005395 0.032520 5.372716 30 0.014240 0.052846 4.841454 90 0.004799 0.024390 3.985088 50 0.008600 0.020325 1.646364 69 0.006249 0.016260 1.588024 01/05/20 09:59 Óbitos In [38]: f2d = bf.first_digits(covid_Norte.obitosNovos, digs=1, decimals=8, confidence=8 0) Initialized sequence with 105 registries. First_1_Dig 1 0.466667 2 0.276190 3 0.104762 4 0.047619 5 0.038095 6 0.000000 7 0.028571 8 0.019048 9 0.019048 Name: Found, dtype: float64 Test performed on 105 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 1 0.301030 0.466667 3.593755 2 0.176091 0.276190 2.564774 01/05/20 09:59 In [45]: f2d = bf.first_digits(covid_Norte.obitosNovos, digs=2, decimals=8, confidence=8 0) Initialized sequence with 105 registries. First_2_Dig 10 0.361905 11 0.009524 12 0.000000 13 0.000000 14 0.028571 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 105 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 10 0.041393 0.361905 16.242620 20 0.021189 0.219048 13.739193 30 0.014240 0.085714 5.769671 50 0.008600 0.038095 2.744709 40 0.010724 0.038095 2.249316 90 0.004799 0.019048 1.406664 Nordeste 01/05/20 09:59 In [11]: covid_Nordeste = covid[covid['regiao']=='Nordeste'] Casos In [46]: f2d = bf.first_digits(covid_Nordeste.casosNovos, digs=1, decimals=8, confidence =80) Initialized sequence with 357 registries. First_1_Dig 1 0.327731 2 0.196078 3 0.162465 4 0.081232 5 0.081232 6 0.044818 7 0.030812 8 0.050420 9 0.025210 Name: Found, dtype: float64 Test performed on 357 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 3 0.124939 0.162465 2.064346 01/05/20 09:59 In [47]: f2d = bf.first_digits(covid_Nordeste.casosNovos, digs=2, decimals=8, confidence =80) Initialized sequence with 357 registries. First_2_Dig 10 0.126050 11 0.033613 12 0.025210 13 0.030812 14 0.025210 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 357 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 20 0.021189 0.092437 9.163786 50 0.008600 0.050420 8.270778 30 0.014240 0.067227 8.226552 10 0.041393 0.126050 7.897209 80 0.005395 0.030812 6.194778 70 0.006160 0.022409 3.585465 90 0.004799 0.016807 2.900098 40 0.010724 0.025210 2.400464 01/05/20 09:59 Óbitos In [48]: f2d = bf.first_digits(covid_Nordeste.obitosNovos, digs=1, decimals=8, confidenc e=80) Initialized sequence with 217 registries. First_1_Dig 1 0.433180 2 0.207373 3 0.147465 4 0.064516 5 0.050691 6 0.036866 7 0.018433 8 0.018433 9 0.023041 Name: Found, dtype: float64 Test performed on 217 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 1 0.30103 0.43318 4.169874 01/05/20 09:59 In [49]: f2d = bf.first_digits(covid_Nordeste.obitosNovos, digs=2, decimals=8, confidenc e=80) Initialized sequence with 217 registries. First_2_Dig 10 0.322581 11 0.004608 12 0.036866 13 0.018433 14 0.000000 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 217 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 10 0.041393 0.322581 20.623910 20 0.021189 0.161290 14.094887 30 0.014240 0.129032 13.985809 40 0.010724 0.059908 6.704731 50 0.008600 0.046083 5.612170 60 0.007179 0.032258 3.974108 90 0.004799 0.023041 3.397428 80 0.005395 0.018433 2.158585 70 0.006160 0.018433 1.876766 01/05/20 09:59 Sudeste In [14]: covid_Sudeste = covid[covid['regiao']=='Sudeste'] Casos In [50]: f2d = bf.first_digits(covid_Sudeste.casosNovos, digs=1, decimals=8, confidence= 80) Initialized sequence with 204 registries. First_1_Dig 1 0.289216 2 0.156863 3 0.122549 4 0.098039 5 0.102941 6 0.102941 7 0.053922 8 0.044118 9 0.029412 Name: Found, dtype: float64 Test performed on 204 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 6 0.066947 0.102941 1.916921 01/05/20 09:59 In [51]: f2d = bf.first_digits(covid_Sudeste.casosNovos, digs=2, decimals=8, confidence= 80) Initialized sequence with 204 registries. First_2_Dig 10 0.083333 11 0.024510 12 0.039216 13 0.014706 14 0.039216 ... 95 0.000000 96 0.004902 97 0.000000 98 0.004902 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 204 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 30 0.014240 0.049020 3.897169 10 0.041393 0.083333 2.831499 20 0.021189 0.049020 2.517025 80 0.005395 0.019608 2.293336 52 0.008273 0.024510 2.173940 60 0.007179 0.019608 1.688168 56 0.007687 0.019608 1.548705 82 0.005264 0.014706 1.379786 01/05/20 09:59 Óbitos In [52]: f2d = bf.first_digits(covid_Sudeste.obitosNovos, digs=1, decimals=8, confidence =80) Initialized sequence with 134 registries. First_1_Dig 1 0.283582 2 0.179104 3 0.186567 4 0.119403 5 0.059701 6 0.067164 7 0.044776 8 0.044776 9 0.014925 Name: Found, dtype: float64 Test performed on 134 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 3 0.124939 0.186567 2.026942 01/05/20 09:59 In [53]: f2d = bf.first_digits(covid_Sudeste.obitosNovos, digs=2, decimals=8, confidence =80) Initialized sequence with 134 registries. First_2_Dig 10 0.149254 11 0.007463 12 0.022388 13 0.007463 14 0.007463 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 134 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 30 0.014240 0.141791 12.097438 10 0.041393 0.149254 6.051247 40 0.010724 0.067164 5.923829 20 0.021189 0.097015 5.794895 60 0.007179 0.037313 3.620420 80 0.005395 0.029851 3.275001 50 0.008600 0.037313 3.131845 70 0.006160 0.029851 2.952799 41 0.010465 0.029851 1.780667 01/05/20 09:59 Sul In [17]: covid_Sul = covid[covid['regiao']=='Sul'] Casos In [54]: f2d = bf.first_digits(covid_Sul.casosNovos, digs=1, decimals=8, confidence=80) Initialized sequence with 142 registries. First_1_Dig 1 0.246479 2 0.218310 3 0.183099 4 0.105634 5 0.042254 6 0.063380 7 0.056338 8 0.021127 9 0.063380 Name: Found, dtype: float64 Test performed on 142 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 3 0.124939 0.183099 1.969142 01/05/20 09:59 In [55]: f2d = bf.first_digits(covid_Sul.casosNovos, digs=2, decimals=8, confidence=80) Initialized sequence with 142 registries. First_2_Dig 10 0.056338 11 0.021127 12 0.028169 13 0.028169 14 0.021127 ... 95 0.000000 96 0.000000 97 0.000000 98 0.007042 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 142 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 90 0.004799 0.049296 7.065546 70 0.006160 0.035211 3.888058 20 0.021189 0.063380 3.199699 39 0.010995 0.028169 1.560099 62 0.006949 0.021127 1.528718 60 0.007179 0.021127 1.471806 36 0.011899 0.028169 1.401035 35 0.012234 0.028169 1.345602 Óbitos 01/05/20 09:59 In [56]: f2d = bf.first_digits(covid_Sul.obitosNovos, digs=1, decimals=8, confidence=80) Initialized sequence with 78 registries. First_1_Dig 1 0.384615 2 0.230769 3 0.217949 4 0.064103 5 0.076923 6 0.012821 7 0.012821 8 0.000000 9 0.000000 Name: Found, dtype: float64 Test performed on 78 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 3 0.124939 0.217949 2.313109 1 0.301030 0.384615 1.485903 01/05/20 09:59 In [57]: f2d = bf.first_digits(covid_Sul.obitosNovos, digs=2, decimals=8, confidence=80) Initialized sequence with 78 registries. First_2_Dig 10 0.384615 11 0.000000 12 0.000000 13 0.000000 14 0.000000 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 78 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 10 0.041393 0.384615 14.933213 30 0.014240 0.217949 14.706958 20 0.021189 0.230769 12.459440 50 0.008600 0.076923 5.921731 40 0.010724 0.064103 4.027347 Centro Oeste In [20]: covid_CentroOeste = covid[covid['regiao']=='Centro-Oeste'] 01/05/20 09:59 Casos In [58]: f2d = bf.first_digits(covid_CentroOeste.casosNovos, digs=1, decimals=8, confide nce=80) Initialized sequence with 177 registries. First_1_Dig 1 0.288136 2 0.214689 3 0.141243 4 0.124294 5 0.067797 6 0.050847 7 0.050847 8 0.028249 9 0.033898 Name: Found, dtype: float64 Test performed on 177 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Empty DataFrame Columns: [Expected, Found, Z_score] Index: [] 01/05/20 09:59 In [59]: f2d = bf.first_digits(covid_CentroOeste.casosNovos, digs=2, decimals=8, confide nce=80) Initialized sequence with 177 registries. First_2_Dig 10 0.079096 11 0.011299 12 0.033898 13 0.016949 14 0.022599 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 177 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 20 0.021189 0.146893 11.351542 40 0.010724 0.090395 9.926079 30 0.014240 0.101695 9.503024 70 0.006160 0.045198 6.157247 50 0.008600 0.050847 5.680048 90 0.004799 0.033898 5.058207 80 0.005395 0.022599 2.611514 10 0.041393 0.079096 2.329498 60 0.007179 0.022599 1.984927 01/05/20 09:59 Óbitos In [60]: f2d = bf.first_digits(covid_CentroOeste.obitosNovos, digs=1, decimals=8, confid ence=80) Initialized sequence with 55 registries. First_1_Dig 1 0.672727 2 0.254545 3 0.036364 4 0.036364 5 0.000000 6 0.000000 7 0.000000 8 0.000000 9 0.000000 Name: Found, dtype: float64 Test performed on 55 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 1 0.301030 0.672727 5.862497 2 0.176091 0.254545 1.350525 01/05/20 09:59 In [61]: f2d = bf.first_digits(covid_CentroOeste.obitosNovos, digs=2, decimals=8, confid ence=80) Initialized sequence with 55 registries. First_2_Dig 10 0.672727 11 0.000000 12 0.000000 13 0.000000 14 0.000000 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 55 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 10 0.041393 0.672727 23.166460 20 0.021189 0.254545 11.548768 Dados por Estado A mostra atual, num universo de 2484 registros, fica prejudicada quando tentamos analisar os Estados separadamente, pois em tal granularidade não tem-se dados suficientes para uma confiabilidade satisfatória devido ao fato do período compreender pouco mais de 90 dias após os primeiros casos confirmados de COVID-19 01/05/20 09:59 Conclusão A distribuição de dígitos possui uma discrepância muito maior em relação ao esperado pela Lei de Benford quando comparamos dados de Casos reportados com o de Óbitos reportados. Isso incentiva a necessidade de uma análise mais profunda e mesmo mais granularizada sobre os dados enviados ao Ministério da Saúde para se averiguar qual pode ter sido a causa dessas diferenças. 01/05/20 09:59
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-