Read html with Beautifulsoup and find typical data

Multi tool use
Multi tool use
The name of the picture


Read html with Beautifulsoup and find typical data



I wrote similar question before, but I need something different what I got from previous question.



I have a html data which is written below. (part of the data where I need)



I already got rcpNo value, but eleId is changed from 1 to 33, offset, length don't have any regular pattern. Three of the data is consist of numbers, sometime different digit.



I need to read rcpNO, eleId, offset, length and dtd



(dtd is fixed as 'dart3.xsd' but I try this only one html so there is possibility different dtd value for different html data. So I want to read from html data.)


# This is the part of html
#viewDoc(rcpNo, dcmNo, eleId, offset, length, dtd)


treeNode1.appendChild(treeNode2);

treeNode2 = new Tree.TreeNode({
text: "4. The number of stocks",
id: "7",
cls: "text",
listeners: {
click: function() {viewDoc('20180515000480', '6177478', '7', '59749', '7130', 'dart3.xsd');}
}
});
cnt++;



Similar data is repeated so I write some part of html


treeNode2 = new Tree.TreeNode({
text: "1. Summary information",
id: "12",
cls: "text",
listeners: {
click: function() {viewDoc('20180515000480', '6177478', '12', '189335', '18247', 'dart3.xsd');}
}
});
cnt++;

treeNode1.appendChild(treeNode2);

treeNode2 = new Tree.TreeNode({
text: "2. Linked finance state",
id: "13",
cls: "text",
listeners: {
click: function() {viewDoc('20180515000480', '6177478', '13', '207823', '76870', 'dart3.xsd');}
}
});
cnt++;

treeNode1.appendChild(treeNode2);

treeNode2 = new Tree.TreeNode({
text: "3. Comment for linked finance state",
id: "14",
cls: "text",
listeners: {
click: function() {viewDoc('20180515000480', '6177478', '14', '284697', '372938', 'dart3.xsd');}
}
});
cnt++;



as you can see above text and id is changed regularly. I want to read all of the dcmNo, eleId, offset, length and dtd information. especially with typical id & text.



I tried to below


string = "{viewDoc('20180515000480', '6177478', '6', '58846', '899', 'dart3.xsd');}"
>>> pattern = re.compile(r'viewDoc('d+', '(d+)', '(d+)', '(d+)', '(d+)', '(d+)' .+)', re.MULTILINE | re.DOTALL)



and with Beautifulsoup


>>> soup = BeautifulSoup(html, 'html.parser')
>>> soup.find_all(string = pattern)



and this command find all html , I cannot distinguish the data :(



But it doesn't work and it find the first text from html what I don't have to read.



I'm not good at use python regular expression untill now but I have to solve this problem as soon as possible.



Please help me to read typical data from html...



Thank you for your helping..



Best regards



Edit



This is how can I get the html from url


from bs4 import BeautifulSoup
import requests
import re

url = "http://dart.fss.or.kr/api/search.json?auth="+API_KEY
+"&crp_cd="+company_code + "&page_set=100"
+"&start_dt=19990101&bsn_tp=A001&bsn_tp=A002&bsn_tp=A003"

json_data = requests.get(url).json()
list = json_data['list']

data = pd.DataFrame.from_dict(list)

print(data['rcp_no'][0])

url2 = "http://dart.fss.or.kr/dsaf001/main.do?rcpNo="+data['rcp_no'][0]

temp = requests.get(url2)

html = temp.text

soup = BeautifulSoup(html, "html.parser")



and above example of html is the part of print(soup).
As I said, there are a lot of same format in html and I want to read typical line. For example, if I can find below line then I want to get the data


# viewDoc(rcpNo, dcmNo, eleId, offset, length, dtd)

viewDoc('20180515000480', '6177478', '7', '59749', '7130', 'dart3.xsd')

viewDoc('20180515000480', '6177478', '13', '207823', '76870', 'dart3.xsd')



like, ['6177478', '7', '59749', '7130', 'dart3.xsd'], ['6177478', '7', '59749', '7130', 'dart3.xsd'], number and text data (dcmNo, eleId, offset, length and dtd)





I don't understand. Are you parsing Javascript through BeautifulSoup? Where's the HTML code?
– Andrej Kesely
6 hours ago





I bring the html code with this part temp = request.get(url) html = temp.text soup = BeautifulSoup(html, "html.parser") and I wrote the part of print(soup) result
– Gangil Seo
5 hours ago





Can you edit your question and post sample of the HTML(or URL) there and what result do you want?
– Andrej Kesely
5 hours ago





I edit the additional part but I'm not sure my edit is enough to complement the code what you need to help.. :(
– Gangil Seo
4 hours ago









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

MP1GkvNX,rkiJAK5ASaijM,JxtwHi70PAbRA9l kLBmfeX,E0saxkf6XS,iTtgk4TgOYmy6ao YY2t97RP8tPwG3,BtHWeIVY
qYaF3knFiAAdYKmjw S6zN07 Q5ZPuLXW9vx58weJ6ByuJinB CV3vkP0JYV4vIGp0DE,dzA

Popular posts from this blog

Keycloak server returning user_not_found error when user is already imported with LDAP

PHP parse/syntax errors; and how to solve them?

Using generate_series in ecto and passing a value