Read html with Beautifulsoup and find typical data

Read html with Beautifulsoup and find typical data

I wrote similar question before, but I need something different what I got from previous question.

I have a html data which is written below. (part of the data where I need)

I already got rcpNo value, but eleId is changed from 1 to 33, offset, length don't have any regular pattern. Three of the data is consist of numbers, sometime different digit.

I need to read rcpNO, eleId, offset, length and dtd

(dtd is fixed as 'dart3.xsd' but I try this only one html so there is possibility different dtd value for different html data. So I want to read from html data.)

# This is the part of html #viewDoc(rcpNo, dcmNo, eleId, offset, length, dtd) treeNode1.appendChild(treeNode2); treeNode2 = new Tree.TreeNode({ text: "4. The number of stocks", id: "7", cls: "text", listeners: { click: function() {viewDoc('20180515000480', '6177478', '7', '59749', '7130', 'dart3.xsd');} } }); cnt++;

Similar data is repeated so I write some part of html

treeNode2 = new Tree.TreeNode({ text: "1. Summary information", id: "12", cls: "text", listeners: { click: function() {viewDoc('20180515000480', '6177478', '12', '189335', '18247', 'dart3.xsd');} } }); cnt++; treeNode1.appendChild(treeNode2); treeNode2 = new Tree.TreeNode({ text: "2. Linked finance state", id: "13", cls: "text", listeners: { click: function() {viewDoc('20180515000480', '6177478', '13', '207823', '76870', 'dart3.xsd');} } }); cnt++; treeNode1.appendChild(treeNode2); treeNode2 = new Tree.TreeNode({ text: "3. Comment for linked finance state", id: "14", cls: "text", listeners: { click: function() {viewDoc('20180515000480', '6177478', '14', '284697', '372938', 'dart3.xsd');} } }); cnt++;

as you can see above text and id is changed regularly. I want to read all of the dcmNo, eleId, offset, length and dtd information. especially with typical id & text.

I tried to below

string = "{viewDoc('20180515000480', '6177478', '6', '58846', '899', 'dart3.xsd');}" >>> pattern = re.compile(r'viewDoc('d+', '(d+)', '(d+)', '(d+)', '(d+)', '(d+)' .+)', re.MULTILINE | re.DOTALL)

and with Beautifulsoup

>>> soup = BeautifulSoup(html, 'html.parser') >>> soup.find_all(string = pattern)

and this command find all html , I cannot distinguish the data :(

But it doesn't work and it find the first text from html what I don't have to read.

I'm not good at use python regular expression untill now but I have to solve this problem as soon as possible.

Please help me to read typical data from html...

Thank you for your helping..

Best regards

Edit

This is how can I get the html from url

from bs4 import BeautifulSoup import requests import re url = "http://dart.fss.or.kr/api/search.json?auth="+API_KEY +"&crp_cd="+company_code + "&page_set=100" +"&start_dt=19990101&bsn_tp=A001&bsn_tp=A002&bsn_tp=A003" json_data = requests.get(url).json() list = json_data['list'] data = pd.DataFrame.from_dict(list) print(data['rcp_no'][0]) url2 = "http://dart.fss.or.kr/dsaf001/main.do?rcpNo="+data['rcp_no'][0] temp = requests.get(url2) html = temp.text soup = BeautifulSoup(html, "html.parser")

and above example of html is the part of print(soup).
As I said, there are a lot of same format in html and I want to read typical line. For example, if I can find below line then I want to get the data

# viewDoc(rcpNo, dcmNo, eleId, offset, length, dtd) viewDoc('20180515000480', '6177478', '7', '59749', '7130', 'dart3.xsd') viewDoc('20180515000480', '6177478', '13', '207823', '76870', 'dart3.xsd')

like, ['6177478', '7', '59749', '7130', 'dart3.xsd'], ['6177478', '7', '59749', '7130', 'dart3.xsd'], number and text data (dcmNo, eleId, offset, length and dtd)

I don't understand. Are you parsing Javascript through BeautifulSoup? Where's the HTML code?
– Andrej Kesely
6 hours ago

I bring the html code with this part temp = request.get(url) html = temp.text soup = BeautifulSoup(html, "html.parser") and I wrote the part of print(soup) result
– Gangil Seo
5 hours ago

Can you edit your question and post sample of the HTML(or URL) there and what result do you want?
– Andrej Kesely
5 hours ago

I edit the additional part but I'm not sure my edit is enough to complement the code what you need to help.. :(
– Gangil Seo
4 hours ago

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Xuykyuu