COVID-19-Brasil Análise de dados contra a Lei de Benford Devido a incertezas a cerca dos números reais sobre o COVID-19 no Brasil, esta é uma análise que visa descobrir se os casos de infecção e mortes reportados oficialmente correspondem à famosa Distribuição de Benford. Mais informacões sobre a Lei de Benford : https://pt.wikipedia.org/wiki/Lei_de_Benford ( https://pt.wikipedia.org /wiki/Lei_de_Benford ) Esta análise não visa, em nenhum momento, provar ou desprovar qualquer irregularidade sobre os dados registrados pelo país, mas sim indicar se existe ou não a necessidade de uma averiguação mais detalhada por parte das autoridades quanto ao método de coleta e validação dessas informações Utilizaremos ferramentas de análise de dados disponíveis na linguagem Python e suas bibiotecas, e a sequência abaixo descrita pode ser reproduzida por quem tiver interesse. Estamos usando números oficiais, distribuídos por Estado e Data entre Janeiro e Abril de 2020. Primeiro, iremos importar as Bibliotecas. In [1]: % matplotlib inline import numpy as np import pandas as pd import benford as bf Agora, iremos carregar os dados (Fonte https://covid.saude.gov.br/ ( https://covid.saude.gov.br/ ) ) In [2]: covid = pd read_csv ( 'data/covid.csv' , index_col = 'idx' , parse_dates = True ) Vamos dar uma olhada nas linhas e colunas que obtivemos. 01/05/20 09:59 In [3]: covid Os valores que iremos comparar são casosNovos e obitosNovos. Novos Casos Reportados diariamente, para todos os Estados Usaremos a regra dos primeiros dígito para analisarmos contra a Lei de Benford. No gráfico as barras em AZUL indicam ocorrência de dígitos DENTRO da Lei de Benford, barras AMARELAS indicam ocorrências de dígitos FORA da Lei de Benford. Iremos analisar por dois métodos: Primeiro Dígito e Dois Primeiros Dígitos. Out[3]: regiao estado data casosNovos casosAcumulados obitosNovos obitosAcumulados idx 1 Norte RO 2020-01-30 0 0 0 0 2 Norte RO 2020-01-31 0 0 0 0 3 Norte RO 2020-02-01 0 0 0 0 4 Norte RO 2020-02-02 0 0 0 0 5 Norte RO 2020-02-03 0 0 0 0 ... ... ... ... ... ... ... ... 2480 Centro-Oeste DF 2020-04-26 53 1066 1 27 2481 Centro-Oeste DF 2020-04-27 80 1146 0 27 2482 Centro-Oeste DF 2020-04-28 67 1213 1 28 2483 Centro-Oeste DF 2020-04-29 62 1275 0 28 2484 Centro-Oeste DF 2020-04-30 81 1356 2 30 2484 rows × 7 columns 01/05/20 09:59 In [44]: f1d = bf first_digits ( covid casosNovos , digs = 1 , decimals = 8 , confidence = 80 ) Os dígitos 2 e 3 apresentaram uma discrepância significativa (acima de 1.5) quanto ao número de casos reportados nacionalmente In [ ]: Agora , a regra dos 2 primeiros dígitos Initialized sequence with 1126 registries. First_1_Dig 1 0.296625 2 0.206927 3 0.141208 4 0.098579 5 0.070160 6 0.064831 7 0.045293 8 0.041741 9 0.034636 Name: Found, dtype: float64 Test performed on 1126 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 2 0.176091 0.206927 2.677430 3 0.124939 0.141208 1.606001 01/05/20 09:59 In [43]: f2d = bf first_digits ( covid casosNovos , digs = 2 , decimals = 8 , confidence = 80 ) Initialized sequence with 1126 registries. First_2_Dig 10 0.103908 11 0.026643 12 0.027531 13 0.023979 14 0.023979 ... 95 0.001776 96 0.000888 97 0.000000 98 0.002664 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 1126 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 20 0.021189 0.095915 17.307789 30 0.014240 0.061279 13.196401 10 0.041393 0.103908 10.456229 40 0.010724 0.042629 10.249563 70 0.006160 0.029307 9.736247 90 0.004799 0.023979 9.097349 80 0.005395 0.024867 8.716322 50 0.008600 0.031972 8.331923 60 0.007179 0.023091 6.148186 01/05/20 09:59 Embora quando analisa-se apenas o primeiro dígito, a curva não se distancia muito do esperado, mas quando usamos 2 digitos nota-se uma evidente tendência das ocorrências com os dígitos multiplos de 10 aparecerem numa frequência muito diferente do esperado pela Lei de Benford, com discrepâncias acima de 10 no Z_score Óbitos Reportados diariamente, para todos os Estados Usaremos a mesma regra para analisarmos os registros de óbitos. 01/05/20 09:59 In [42]: f1d = bf first_digits ( covid obitosNovos , digs = 1 , decimals = 8 , confidence = 80 ) In [ ]: Nesse caso , todos os dígitos se desviaram da curva , sendo os dígitos 1 , 2 e 3 co m a maior discrepância Agora, a regra dos 2 primeiros dígitos. Initialized sequence with 589 registries. First_1_Dig 1 0.421053 2 0.220713 3 0.147708 4 0.071307 5 0.049236 6 0.030560 7 0.023769 8 0.020374 9 0.015280 Name: Found, dtype: float64 Test performed on 589 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 1 0.301030 0.421053 6.305275 2 0.176091 0.220713 2.789040 3 0.124939 0.147708 1.608930 01/05/20 09:59 In [41]: f2d = bf first_digits ( covid obitosNovos , digs = 2 , decimals = 8 , confidence = 80 ) Initialized sequence with 589 registries. First_2_Dig 10 0.331070 11 0.005093 12 0.018676 13 0.008489 14 0.006791 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 589 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 10 0.041393 0.331070 35.189639 20 0.021189 0.174873 25.755579 30 0.014240 0.127334 22.992064 40 0.010724 0.056027 10.474613 50 0.008600 0.042445 8.672368 60 0.007179 0.022071 4.037273 70 0.006160 0.018676 3.618590 80 0.005395 0.016978 3.556288 90 0.004799 0.015280 3.382709 01/05/20 09:59 Quando analisamos a curva com apenas o primeiro dígito, no caso dos óbitos reportados, notamos uma diferença considerável do esperado pela Lei de Benford e, ao analisarmos os mesmos dados para dois dígitos, a discrepância é muito mais acentuada. Separando por Região Analisemos os mesmos dados, porém separando-os por região federativa. Norte In [8]: covid_Norte = covid [ covid [ 'regiao' ] == 'Norte' ] Casos 01/05/20 09:59 In [40]: f2d = bf first_digits ( covid_Norte casosNovos , digs = 1 , decimals = 8 , confidence = 8 0 ) Initialized sequence with 246 registries. First_1_Dig 1 0.292683 2 0.252033 3 0.101626 4 0.101626 5 0.044715 6 0.073171 7 0.048780 8 0.048780 9 0.036585 Name: Found, dtype: float64 Test performed on 246 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 2 0.176091 0.252033 3.043371 01/05/20 09:59 In [39]: f2d = bf first_digits ( covid_Norte casosNovos , digs = 2 , decimals = 8 , confidence = 8 0 ) Initialized sequence with 246 registries. First_2_Dig 10 0.134146 11 0.032520 12 0.016260 13 0.024390 14 0.012195 ... 95 0.008130 96 0.000000 97 0.000000 98 0.004065 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 246 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 20 0.021189 0.121951 10.752425 40 0.010724 0.065041 7.961669 10 0.041393 0.134146 7.143217 70 0.006160 0.040650 6.506153 60 0.007179 0.040650 5.840977 80 0.005395 0.032520 5.372716 30 0.014240 0.052846 4.841454 90 0.004799 0.024390 3.985088 50 0.008600 0.020325 1.646364 69 0.006249 0.016260 1.588024 01/05/20 09:59 Óbitos In [38]: f2d = bf first_digits ( covid_Norte obitosNovos , digs = 1 , decimals = 8 , confidence = 8 0 ) Initialized sequence with 105 registries. First_1_Dig 1 0.466667 2 0.276190 3 0.104762 4 0.047619 5 0.038095 6 0.000000 7 0.028571 8 0.019048 9 0.019048 Name: Found, dtype: float64 Test performed on 105 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 1 0.301030 0.466667 3.593755 2 0.176091 0.276190 2.564774 01/05/20 09:59 In [45]: f2d = bf first_digits ( covid_Norte obitosNovos , digs = 2 , decimals = 8 , confidence = 8 0 ) Nordeste Initialized sequence with 105 registries. First_2_Dig 10 0.361905 11 0.009524 12 0.000000 13 0.000000 14 0.028571 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 105 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 10 0.041393 0.361905 16.242620 20 0.021189 0.219048 13.739193 30 0.014240 0.085714 5.769671 50 0.008600 0.038095 2.744709 40 0.010724 0.038095 2.249316 90 0.004799 0.019048 1.406664 01/05/20 09:59 In [11]: covid_Nordeste = covid [ covid [ 'regiao' ] == 'Nordeste' ] Casos In [46]: f2d = bf first_digits ( covid_Nordeste casosNovos , digs = 1 , decimals = 8 , confidence = 80 ) Initialized sequence with 357 registries. First_1_Dig 1 0.327731 2 0.196078 3 0.162465 4 0.081232 5 0.081232 6 0.044818 7 0.030812 8 0.050420 9 0.025210 Name: Found, dtype: float64 Test performed on 357 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 3 0.124939 0.162465 2.064346 01/05/20 09:59 In [47]: f2d = bf first_digits ( covid_Nordeste casosNovos , digs = 2 , decimals = 8 , confidence = 80 ) Initialized sequence with 357 registries. First_2_Dig 10 0.126050 11 0.033613 12 0.025210 13 0.030812 14 0.025210 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 357 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 20 0.021189 0.092437 9.163786 50 0.008600 0.050420 8.270778 30 0.014240 0.067227 8.226552 10 0.041393 0.126050 7.897209 80 0.005395 0.030812 6.194778 70 0.006160 0.022409 3.585465 90 0.004799 0.016807 2.900098 40 0.010724 0.025210 2.400464 01/05/20 09:59 Óbitos In [48]: f2d = bf first_digits ( covid_Nordeste obitosNovos , digs = 1 , decimals = 8 , confidenc e = 80 ) Initialized sequence with 217 registries. First_1_Dig 1 0.433180 2 0.207373 3 0.147465 4 0.064516 5 0.050691 6 0.036866 7 0.018433 8 0.018433 9 0.023041 Name: Found, dtype: float64 Test performed on 217 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 1 0.30103 0.43318 4.169874 01/05/20 09:59 In [49]: f2d = bf first_digits ( covid_Nordeste obitosNovos , digs = 2 , decimals = 8 , confidenc e = 80 ) Initialized sequence with 217 registries. First_2_Dig 10 0.322581 11 0.004608 12 0.036866 13 0.018433 14 0.000000 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 217 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 10 0.041393 0.322581 20.623910 20 0.021189 0.161290 14.094887 30 0.014240 0.129032 13.985809 40 0.010724 0.059908 6.704731 50 0.008600 0.046083 5.612170 60 0.007179 0.032258 3.974108 90 0.004799 0.023041 3.397428 80 0.005395 0.018433 2.158585 70 0.006160 0.018433 1.876766 01/05/20 09:59 Sudeste In [14]: covid_Sudeste = covid [ covid [ 'regiao' ] == 'Sudeste' ] Casos In [50]: f2d = bf first_digits ( covid_Sudeste casosNovos , digs = 1 , decimals = 8 , confidence = 80 ) Initialized sequence with 204 registries. First_1_Dig 1 0.289216 2 0.156863 3 0.122549 4 0.098039 5 0.102941 6 0.102941 7 0.053922 8 0.044118 9 0.029412 Name: Found, dtype: float64 Test performed on 204 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 6 0.066947 0.102941 1.916921 01/05/20 09:59 In [51]: f2d = bf first_digits ( covid_Sudeste casosNovos , digs = 2 , decimals = 8 , confidence = 80 ) Initialized sequence with 204 registries. First_2_Dig 10 0.083333 11 0.024510 12 0.039216 13 0.014706 14 0.039216 ... 95 0.000000 96 0.004902 97 0.000000 98 0.004902 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 204 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 30 0.014240 0.049020 3.897169 10 0.041393 0.083333 2.831499 20 0.021189 0.049020 2.517025 80 0.005395 0.019608 2.293336 52 0.008273 0.024510 2.173940 60 0.007179 0.019608 1.688168 56 0.007687 0.019608 1.548705 82 0.005264 0.014706 1.379786 01/05/20 09:59 Óbitos In [52]: f2d = bf first_digits ( covid_Sudeste obitosNovos , digs = 1 , decimals = 8 , confidence = 80 ) Initialized sequence with 134 registries. First_1_Dig 1 0.283582 2 0.179104 3 0.186567 4 0.119403 5 0.059701 6 0.067164 7 0.044776 8 0.044776 9 0.014925 Name: Found, dtype: float64 Test performed on 134 registries. Discarded 0 records < 1 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_1_Dig 3 0.124939 0.186567 2.026942 01/05/20 09:59 In [53]: f2d = bf first_digits ( covid_Sudeste obitosNovos , digs = 2 , decimals = 8 , confidence = 80 ) Initialized sequence with 134 registries. First_2_Dig 10 0.149254 11 0.007463 12 0.022388 13 0.007463 14 0.007463 ... 95 0.000000 96 0.000000 97 0.000000 98 0.000000 99 0.000000 Name: Found, Length: 90, dtype: float64 Test performed on 134 registries. Discarded 0 records < 10 after preparation. The entries with the significant positive deviations are: Expected Found Z_score First_2_Dig 30 0.014240 0.141791 12.097438 10 0.041393 0.149254 6.051247 40 0.010724 0.067164 5.923829 20 0.021189 0.097015 5.794895 60 0.007179 0.037313 3.620420 80 0.005395 0.029851 3.275001 50 0.008600 0.037313 3.131845 70 0.006160 0.029851 2.952799 41 0.010465 0.029851 1.780667 01/05/20 09:59