Exibindo as tags de uma página web com recuo proporcional à profundidade do elemento na estrutura de árvore do documento | Python

Solucionado (ver solução)

Solucionado
(ver solução)

20
respostas

por Edson

| 52.2k xp | 224 posts

Desenvolva a classe MyHTMLParser como uma subclasse de HTMLParser que, quando alimentada com um arquivo HTML, mostra os nomes das tags de início e fim na ordem em que aparecem no documento, e com um recuo proporcional à profundidade do elemento na estrutura de árvore do documento. Ignore os elementos HTML que não exigem uma tag de fim, como p e br

O que eu tentei fazer:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):

    infile = open("w3c.html")
    content = infile.read()
    infile.close()
    myparser = MyHTMLParser()
    myparser.feed(content)

    def handle_starttag(self, tag, attrs): #mostra valor do atributo href, se houver
        for t in tag:
            print(t)

Além de o código não fazer nada, ainda recebo:

NameError: name 'MyHTMLParser' is not defined

Poderiam me ajudar?

20 respostas

por YURI CAUÊ GOMES MARTINS

| 128.2k xp | 25 posts

21/01/2020

Olá Edson tudo bem ?

Eu fiz alguns testes com a sintaxe que você destacou ai em cima e acredito que o seguinte erro :

NameError: name 'MyHTMLParser' is not defined

Esteja acontecendo pois você está 'chamando' a função dentro de seu próprio escopo, tente a sintaxe a seguir pra ver se funciona:

from html.parser import HTMLParser
 class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs): 
            for t in tag:
                print(t)


infile = open("w3c.html")
content = infile.read()
infile.close()
myparser = MyHTMLParser()
myparser.feed(content)

Caso não de certo atualize o tópico pra tentarmos resolver o problema

OBS(segui a sintaxe da documentação do python : https://docs.python.org/3/library/html.parser.html)

por Edson

| 52.2k xp | 224 posts

21/01/2020

@YURI CAUÊ GOMES MARTINS Obrigado!

Agora o código realmente não dá mais erro. Acontece que a minha lógica parece estar errada: o programa não faz o que está sendo pedido. Poderia, por favor, me ajudar?

por YURI CAUÊ GOMES MARTINS

| 128.2k xp | 25 posts

21/01/2020

@Edson tudo bem cara ?

Poderia me dizer o que exatamente seu programa está fazendo para que eu possa entender melhor ?

por Edson

| 52.2k xp | 224 posts

21/01/2020

@YURI CAUÊ GOMES MARTINS : Segue o arquivo HTML a ser lido:

<html>
<head>
<title>W3C Mission Summary</title>
</head>

<body>
<h1>W3C Mission</h1>
<p>
The W3C mission is to lead the World Wide Web to its full potential<br>
by developing protocols and guidelines that ensure the long-term growth of the Web.
</p>
<h2>Principles</h2>
<ul>
<li>Web for All</li>
<li>Web on Everything</li>
</ul>
See the complete <a href="http://www.w3.org/Consortium/mission.html">W3C Mission document</a>.
</body>
</html>

A saída do meu programa:

h
t
m
l
h
e
a
d
t
i
t
l
e
b
o
d
y
h
1
p
b
r
h
2
u
l
l
i
l
i
a

A saída deveria ser:

html start
    head start
        title start
        title end
    head end
    body start
        h1 start
        h1 end
        h2 start
        h2 end
        ul start
            li start
...
        a end
    body end
html end

por YURI CAUÊ GOMES MARTINS

| 128.2k xp | 25 posts

21/01/2020

Edson eu dei uma lida na internet e achei um site onde descreve que a função open da sua variável "infile" recebe dois parametros ou seja:

infile = open(nome do arquivo, opção de acesso)

No caso como você quer ler o arquivo, deve passar no segundo parâmetro a opção 'r':

infile = open("w3c.html", "r")

infile = open("w3c.html", "r")
content = infile.read()
infile.close()
myparser = MyHTMLParser()
myparser.feed(content)

por Edson

| 52.2k xp | 224 posts

21/01/2020

@YURI CAUÊ GOMES MARTINS :

A saída do programa continua errada...

por YURI CAUÊ GOMES MARTINS

| 128.2k xp | 25 posts

21/01/2020

Olá Edson tudo bem cara, eu fiz uma alteração ao invés de fazer um "for" eu fiz um "print" no retorno da função, não sei se é desse jeito que você espera mas dê uma olhada:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):


    def handle_starttag(self, tag, attrs): #mostra valor do atributo href, se houver
        print (tag, "start")

infile = open("w3c.html", "r")
content = infile.read()
infile.close()
myparser = MyHTMLParser()
myparser.feed(content)

por YURI CAUÊ GOMES MARTINS

| 128.2k xp | 25 posts

21/01/2020

A saída do meu terminal retornou isso:

html start
head start
title start
body start
h1 start
p start
br start
h2 start
ul start
li start
li start
a start

por Edson

| 52.2k xp | 224 posts

21/01/2020

@YURI CAUÊ GOMES MARTINS

Melhorou bastante. Só faltou a parte de " recuo proporcional à profundidade do elemento na estrutura de árvore do documento"... não faço a menor ideia de como fazer isso... Faltou também a parte dos "end" das tags

por YURI CAUÊ GOMES MARTINS

| 128.2k xp | 25 posts

21/01/2020

Puts Edson agora que prestei atenção na saída que você deseja tem um "start" e um "end" a cada tag, o grande problema é que nós só conseguimos listar as tags mas apontar onde começa e onde termina não foi integrado. Estamos perto porém não sei como fazer a saída esperada =(

por Edson

| 52.2k xp | 224 posts

21/01/2020

@YURI CAUÊ GOMES MARTINS Obrigado! Estou tentando aqui também!

por YURI CAUÊ GOMES MARTINS

| 128.2k xp | 25 posts

21/01/2020

@Edson eu que agradeço !!! Cara vou tentando aqui também, caso você consiga poste aqui, e marque com resolvido man =)

por Edson

| 52.2k xp | 224 posts

21/01/2020

@YURI CAUÊ GOMES MARTINS

Estou lendo: https://docs.python.org/3/library/html.parser.html

por Edson

| 52.2k xp | 224 posts

21/01/2020

@YURI CAUÊ GOMES MARTINS Agora só falta a parte do recuo:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):


    def handle_starttag(self, tag, attrs): #mostra valor do atributo href, se houver
        print (tag, "start")

    def handle_endtag(self, tag):
        print(tag, "end")



infile = open("w3c.html", "r")
content = infile.read()
infile.close()
myparser = MyHTMLParser()
myparser.feed(content)

Saída:

html start
head start
title start
title end
head end
body start
h1 start
h1 end
p start
br start
p end
h2 start
h2 end
ul start
li start
li end
li start
li end
ul end
a start
a end
body end
html end

por YURI CAUÊ GOMES MARTINS

| 128.2k xp | 25 posts

21/01/2020

@Edson cheguei na solução cara, a maneira pra deixar as tags dinâmicas é fazendo uma outra função que consiste no fechamento da tag segue abaixo:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):


    def handle_starttag(self, tag, attrs): #mostra valor do atributo href, se houver
        print ("Start tag:", tag)
        for attr in attrs:
            print ("attr:", attr)

    def handle_endtag(self, tag):
        print ("End tag :", tag)



infile = open("w3c.html","r")
content = infile.read()
infile.close()
myparser = MyHTMLParser()
myparser.feed(content)

Eu segui a documentação do site: https://docs.python.org/2/library/htmlparser.html

por YURI CAUÊ GOMES MARTINS

| 128.2k xp | 25 posts

21/01/2020

Obs: A saída não é exatamente como você descreveu acima pois ele segue a hierarquia do escopo HTML ou seja a tag html será a última a fechar e assim por diante...

Espero ter ajudado cara !!!

por Edson

| 52.2k xp | 224 posts

23/01/2020

@YURI CAUÊ GOMES MARTINS

Achei uma solução na internet mas não entendi nada:

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
    'HTML doc parser that prints tags indented '

    def __init__(self):
        'initializes the parser and the initial indentation'
        HTMLParser.__init__(self)
        self.indent = 0            # initial indentation value

    def handle_starttag(self, tag, attrs):
        '''prints start tag with an indentation proportional
           to the depth of the tag's element in the document'''
        if tag not in {'br','p'}:
            print('{}{} start'.format(self.indent*' ', tag))
            self.indent += 4

    def handle_endtag(self, tag):
        '''prints end tag with an indentation proportional
           to the depth of the tag's element in the document'''
        if tag not in {'br','p'}:
            self.indent -= 4
            print('{}{} end'.format(self.indent*' ', tag))

solução!

por Lucas Peixoto de Alencar Rocha

| 1057.1k xp | 1640 posts

Instrutor

23/01/2020

Olá Edson e Yuri,

Vocês chegaram em uma solução bem legal, vou tentar ajudar um pouco na lógica desse último código.

Basicamente é criado uma variável indent para guardar o número de espaços (recuo) dentro de cada tag. Começa com o valor 0, por que vamos iniciar com 0 espaços:

html start
html end

Então sempre que é encontrada uma nova tag, precisamos adicionar 4 espaços para manter o recuo correto, então somamos 4 ao valor de indent no handle_starttag() e escrevemos esses espaços antes de escrever a próxima tag utilizando esse código print('{}{} end'.format(self.indent*' ', tag)):

html start
    head start
    head end
html end

E para desfazer o recuo corretamente, sempre que fechamos uma tag, subtraímos 4 do valor de indent no handle_endtag().

por YURI CAUÊ GOMES MARTINS

| 128.2k xp | 25 posts

23/01/2020

Olá Lucas, muito obrigado pela explicação !!!

por Edson

| 52.2k xp | 224 posts

23/01/2020

@Lucas Peixoto de Alencar Rocha: Muito obrigado! @YURI CAUÊ GOMES MARTINS : você também ajudou bastante! Obrigado!